Writing a pandas dataframe to a csv file and renaming on a for loop - pandas

I have a script that reads SQL db to a pandas data frame which is then concatenated together to form one dataframe on a loop. I need to write this second data frame to a csv file and rename this from a list of ID's
I am using pd.to_csv to write the file and os.rename to change the name.
for X, df in d.iteritems():
newdf = pd.concat(d)
for X in newdf:
export_csv = newdf.to_csv (r'/Users/uni/Desktop/corrindex+id/X.csv', index = False, header = None)
for X in NAMES:
os.rename ('X.csv',X)
This is the code that concatenates the data frames together.
In the third loop, NAMES = 'rt35' but in the future this will be a list of similar names.
I expect to get a file named rt35.csv. However I either get r.csv or X.csv and this error:
OSError: [Errno 2] No such file or directory
The files are writing correctly, the only issue is the name.

In your code, the X variable is inside of a string so python considers X as a character and not as a variable. You should do it like that :
export_csv = newdf.to_csv (r'/Users/uni/Desktop/corrindex+id/{}.csv'.format(X), index = False, header = None)
Same here :
for X in NAMES:
os.rename (X +'.csv',X)

Related

How to iterate over a list of csv files and compile files with common filenames into a single csv as multiple columns

I am currently iterating through a list of csv files and want to combine csv files with common filename strings into a single csv file merging the data from the new csv file as a set of two new columns. I am having trouble with the final part of this in that the append command adds the data as rows at the base of the csv. I have tried with pd.concat, but must be going wrong somewhere. Any help would be much appreciated.
**Note the code is using Python 2 - just for compatibility with the software I am using - Python 3 solution welcome if it translates.
Here is the code I'm currently working with:
rb_headers = ["OID_RB", "Id_RB", "ORIG_FID_RB", "POINT_X_RB", "POINT_Y_RB"]
for i in coords:
if fnmatch.fnmatch(i, '*RB_bank_xycoords.csv'):
df = pd.read_csv(i, header=0, names=rb_headers)
df2 = df[::-1]
#Export the inverted RB csv file as a new csv to the original folder overwriting the original
df2.to_csv(bankcoords+i, index=False)
#Iterate through csvs to combine those with similar key strings in their filenames and merge them into a single csv
files_of_interest = {}
forconc = []
for filename in coords:
if filename[-4:] == '.csv':
key = filename[:39]
files_of_interest.setdefault(key, [])
files_of_interest[key].append(filename)
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df = buff_df.append(pd.read_csv(filename))
files_of_interest[key]=buff_df
redundant_headers = ["OID", "Id", "ORIG_FID", "OID_RB", "Id_RB", "ORIG_FID_RB"]
outdf = buff_df.drop(redundant_headers, axis=1)
If you want only to merge in one file:
paths_list=['path1', 'path2',...]
dfs = [pd.read_csv(f, header=None, sep=";") for f in paths_list]
dfs=pd.concat(dfs,ignore_index=True)
dfs.to_csv(...)

how to add header row to df.to_csv output file only after a certain condition is met

I am reading a bulk download csv file of Stock prices and splitting it into many individual csv's based on ticker, where ticker name is the name of the outputted file and where the header row which contains "ticker, date, open, high, low, close, volume is being written ONLY the the first time I run the script because if I run it again with the header set to true, it writes a new header row mixed in with the stock data. I have mode set to "a", meaning "append" because I want each new row of data added to the file. However, I see a situation now where a new ticker has appeared in the source file, and because I have the header set to False, there is no header in this newly created output file, which causes processing to fail. How can I include a condition so that it writes a header row ONLY for new files which never existed before. Here is my code. Thanks
import pandas as pd
import os
import csv
import itertools
import datetime
datetime = datetime.datetime.today().strftime('%Y-%m-%d')
filename = "Bats_"+(datetime)+".csv"
csv_file = ("H:\\EOD_DATA_RECENT\\DOWNLOADS\\"+filename)
path = 'H:\\EOD_DATA_RECENT\\VIA-API-CALL\\BATS\\'
df = pd.read_csv(csv_file)
for i, g in df.groupby('Ticker'):
# SET HEADER TO TRUE THE FIRST RUN, THEN SET TO FALSE THEREAFTER
g.to_csv(path + '{}.csv'.format(i), mode='a', header=False, index=False, index_label=None)
print(df.tail(5))
FINAL CODE SNIPPET BELOW THAT WORKS. Thanks
for i, g in df.groupby('Ticker'):
if os.path.exists(path+i+".csv"):
g.to_csv(path + '{}.csv'.format(i), mode='a', header=False, index=False, index_label=None)
else:
g.to_csv(path + '{}.csv'.format(i), mode='w', header=True, index=False, index_label=None)
print(df.tail(5))

Getting wildcard from input files when not used in output files

I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.
In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?
runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole = ['big','small']
rule all:
input:
expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)
rule merge_results:
input:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
output:
os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
run:
import pandas as pd
import os
# Iterate over input files, read into pandas df
tmplist = []
for f in input:
data = pd.read_csv(f, sep='\t')
# getting the pheno from the input file and adding it to the data frame
pheno = os.path.split(f)[1].split('.')[0]
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.
In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:
run:
import pandas as pd
import os
# Iterate over phenotypes, read into pandas df
tmplist = []
for pheno in runpheno:
# conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example
file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')
data = pd.read_csv(file, sep='\t')
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.

How to skip duplicate headers in multiple CSV files having indetical columns and merge as one big data frame

I have copied 34 CSV files having identical columns in google colab and trying to merge as one big data frame. However, each CSV has a duplicate header which needs to be skipped.
The actual header anyway will be skipped while concatenating, as my CSV files having identical columns correct?
dfs = [pd.read_csv(path.join('/content/drive/My Drive/',x)skiprows=1) for x in os.listdir('/content/drive/My Drive/') if path.isfile(path.join('/content/drive/My Drive/',x))]
df = pd.concat(dfs)
Above code throwing below error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1: invalid continuation byte
Below code working for sample files,but need an efficient way to skip dup headers and merged into one data frame.Please suggest.
df1=pd.read_csv("./Aug_0816.csv",skiprows=1)
df2=pd.read_csv("./Sep_0916.csv",skiprows=1)
df3=pd.read_csv("./Oct_1016.csv",skiprows=1)
df4=pd.read_csv("./Nov_1116.csv",skiprows=1)
df5=pd.read_csv("./Dec_1216.csv",skiprows=1)
dfs=[df1,df2,df3,df4,df5]
df=pd.concat(dfs)
Have you considered using glob from the standard library?
Try this
path = ('/content/drive/My Drive/')
os.chdir(path)
allFiles = glob.glob("*.csv")
dfs = [pd.read_csv(f,header=None,error_bad_lines=False) for f in allFiles]
#or if you know the specific delimiter for your csv
#dfs = [pd.read_csv(f,header=None,delimiter='yourdelimiter') for f in allFiles]
df = pd.concat(dfs)
Try this, the most generic script for concatenating multiple 'n' csv files in a specific path with a common file name format!
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f,**kwargs) for f in flist], ignore_index=True)
path = r"C:\Users\Jyotsna\Documents"
fmask = os.path.join(path, 'Detail**.csv')
df = get_merged_csv(glob.glob(fmask), index_col=None)
df.head()
If you want to skip some fixed rows and/or columns in each of the files before concatenating, edit the code accordingly on this line!
return pd.concat([pd.read_csv(f, skiprows=4,usecols=range(9),**kwargs) for f in flist], ignore_index=True)

Issue automating CSV import to an RSQLite DB

I'm trying to automate writing CSV files to an RSQLite DB.
I am doing so by indexing csvFiles, which is a list of data.frame variables stored in the environment.
I can't seem to figure out why my dbWriteTable() code works perfectly fine when I enter it manually but not when I try to index the name and value fields.
### CREATE DB ###
mydb <- dbConnect(RSQLite::SQLite(),"")
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in 1:length(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = csvFiles[i], overwrite=T)
i=i+1
}
# EXAMPLE CODE THAT SUCCESSFULLY MANUAL IMPORTS INTO mydb
dbWriteTable(mydb,"DEPARTMENT",DEPARTMENT)
When I run the for loop above, I'm given this error:
"Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'DEPARTMENT': No such file or directory
# note that 'DEPARTMENT' is the value of csvFiles[1]
Here's the dput output of csvFiles:
c("DEPARTMENT", "EMPLOYEE_PHONE", "PRODUCT", "EMPLOYEE", "SALES_ORDER_LINE",
"SALES_ORDER", "CUSTOMER", "INVOICES", "STOCK_TOTAL")
I've researched this error and it seems to be related to my working directory; however, I don't really understand what to change, as I'm not even trying to manipulate files from my computer, simply data.frames already in my environment.
Please help!
Simply use get() for the value argument as you are passing a string value when a dataframe object is expected. Notice your manual version does not have DEPARTMENT quoted for value.
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in seq_along(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = get(csvFiles[i]), overwrite=T)
}
Alternatively, consider building a list of named dataframes with mget and loop element-wise between list's names and df elements with Map:
dfs <- mget(csvfiles)
output <- Map(function(n, d) dbWriteTable(mydb, name = n, value = d, overwrite=T), names(dfs), dfs)