Pandas - Trying to save a set of files by reading it using Pandas but only the latest file gets saved - pandas

I am trying to read a set of txt files into Pandas as below. I see I am able to read them to a Dataframe however when I try to save the Dataframe it only saves the last file it read. However when I perform print(df) it prints all the records.
Given below is the code I am using:
files = '/users/user/files'
list = []
for file in files:
df = pd.read_csv(file)
list.append(df)
print(df)
df.to_csv('file_saved_path')
Could anyone advice why is the last file only being saved to the csv file and now the entire list.
Expected output:
output1
output2
output3
Current output:
output1,output2,output3

Try this:
path = '/users/user/files'
for id in range(len(os.listdir(path))):
file = os.listdir(path)[id]
data = pd.read_csv(path+'/'+file, sep='\t')
if id == 0:
df1 = data
else:
data = pd.concat([df1, data], ignore_index=True)
data.to_csv('file_saved_path')

First change variable name list, because code word in python (builtin), then for final DataFrame use concat:
files = '/users/user/files'
L = []
for file in files:
df = pd.read_csv(file)
L.append(df)
bigdf = pd.concat(L, ignore_index=True)
bigdf.to_csv('file_saved_path')

Related

Clean data with pandas

I have multiple files in a folder where I need to rename the headers, split after the first | and remove 'p.'.
The code looks like this
path = "/home/kristina/snpeff_analysis/a.a/result/Ann.vcf/TEST_P.G_ann.vcf/PLAY.TEST"
all_files = glob.glob(path + "/*_G.P.vcf")
#print(all_files)
aa_df = []
for filename in all_files:
aa_df = pd.read_csv(filename, sep='\t')
new_header = {'Gene':'Gene', 'P':'Aminoacids'}
aa_df.rename(columns=new_header, inplace=True)
aa_df.to_csv(filename, index=False, sep='\t')
#%%
#split & replace
def get_element(my_list, position):
return my_list[position]
df = aa_df
for filename in all_files:
df.Gene.str.split('|').apply(get_element, position=0), df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','').to_csv(filename, index=False, sep='\t')
Ex looking into one file
Gene Aminoacids
gyrA|Rv0007|ppiA|dnaN|recF|Rv0004|gyrB|Rv0008c p.Ser95Thr|.|.|.|.|.|.|.
rpoB|rpoC|atsD|vapB8|vapC8|Rv0666 p.His445Asp|.|.|.|.|.
Rv1313c|Rv1314c|atpC|Rv1312|murA|ogt|rrs .|.|.|.|.|.|.
tlyA|ppnK|recN|Rv1697|mctB|mpg|tyrS|lprJ|Rv1691|Rv1692|Rv1693 p.Leu11Leu|.|.|.|.|.|.|.|.|.|.
The issue that I have is that when running the last part of my script it only outputs the split on the Aminoacids column.
Aminoacids
Ser95Thr
His445Asp
.
Leu11Leu
But when changing the last command to end with .head instead of .to_csv the ouput in the interactive window looks correct.
(0 gyrA
1 rpoB
2 Rv1313c
3 tlyA
Name: Gene, dtype: object,
<bound method NDFrame.head of
0 Ser95Thr
1 His445Asp
2 .
3 Leu11Leu
Name: Aminoacids, dtype: object>)
What am I doing wrong?
IIUC you just need to assign your changes to the columns before exporting it.
df['Gene'] = df['Gene'].str.split('|').apply(get_element, position=0)
df['Aminoacids'] = df['Aminoacids'].str.split('|').apply(get_element, position=0).str.replace('p.','', regex=True)
df.to_csv(out_path, index=False, sep='\t')

Copy/assign a Pandas dataframe based on their name in a for loop

I am relatively new with python - and I am struggling to do the following:
I have a set of different data frames, with sequential naming (df_i), which I want to access in a for loop based on their name (with an string), how can I do that? e.g.
df_1 = pd.read_csv('...')
df_2 = pd.read_csv('...')
df_3 = pd.read_csv('...')
....
n_df = 3
for i in range(len(n_df)):
df_namestr= 'df_' + str(i+1)
# ---------------------
df_temp = df_namestr
# ---------------------
# Operate with df_temp. For i+1= 1, df_temp should be df_1
Kind regards,
DF
You can try something like that:
for n in range(1, n_df+1):
df_namestr = f"df_{n}"
df_tmp = locals().get(df_namestr)
if not isinstance(df_tmp, pd.DataFrame):
continue
print(df_namestr)
print(df_tmp)
Refer to the documentation of locals() to know more.
Would it be better to approach the accessing of multiple dataframes by reading them into a list?
You could put all the csv files required in a subfolder and read them all in. Then they are in a list and you can access each one as an item in that list.
Example:
import pandas as pd
import glob
path = r'/Users/myUsername/Documents/subFolder'
csv_files = glob.glob(path + "/*.csv")
dfs = []
for filename in csv_files:
df = pd.read_csv(filename)
dfs.append(df)
print(len(dfs))
print(dfs[1].head())

Pandas saving in text format

I am trying to save the output, which is a number ,to a text format in pandas after working on the dataset.
import pandas as pd
df = pd.read_csv("sales.csv")
def HighestSales():
df.drop(['index', "month"], axis =1, inplace = True)
df2 = df.groupby("year").sum()
df2 = df2.sort_values(by = 'sales', ascending = True).reset_index()
df3 = df2.loc[11, 'year']
df4 = pd.Series(df3)
df5 = df4.iloc[0]
#*the output here is 1964 , which alone needs to be saved in the text file*.
df5.to_csv("modified.txt")
HighestSales()
But I get 'numpy.int64' object has no attribute 'to_csv'- this error . Is there a way to save just one single value in the text file?
you can do:
# open a file named modified.txt
with open('modified.txt', 'w') as f:
# df5 is just an integer of 196
# and write 1964 plus a line break
f.write(df5 + '\n')
You cannot save a single value to csv by using "pd.to_csv". In your case you should convert it into DataFrame again and then saving it. If you want to see only the number in .txt file, you need to add some parameters:
result = pd.DataFrame(df5)
result.to_csv('modified.txt', index=False, header=False)

How can I use a loop to apply a function to a list of csv files?

I'm trying to loop through all files in a directory and add "indicator" data to them. I had the code working where I could select 1 file and do this, but now am trying to make it work on all files. The problem is when I make the loop it says
ValueError: Invalid file path or buffer object type: <class 'list'>
The goal would be for each loop to read another file from list, make changes, and save file back to folder with changes.
Here is complete code w/o imports. I copied 1 of the "file_path"s from the list and put in comment at bottom.
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file in listdrs_path:
file_path = listdrs_path
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
###################################################
#here is an example of what path from list looks like
#'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/A.csv'
The problem is in line number 13 and 14. Your filename is in variable file but you are using file_path which you've assigned the file list. Because of this you are getting ValueError. Try this:
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
Let me know if it helps.

How to read every file in folder to dataframe named after filename and overlay column names?

I am working on a project where I am downloading public data from (http://pdata.hcad.org/download/) and more particularly downloading the zip files "real_acct_ownership" and "real_building_land".
Each of these zip files contains data on homes built in the houston area, such as addresses, fixtures, sq ft, etc.
My goal is to organize the data so that all the files in the zip folder are data frames indexable by the column "account".
I am running into the issue as to how to create a function or for loop that will read and write the data into a data frame based on file name and how to overlay column names as the data in the zip folders does not contain the column names. The column names can be found in the access zip folder at the top left hand corner labeled "access.zip" of the website.
In my code so far I am calling each file from the above two folders and specifying each column name. I want this to be a iterative process as I will have to do this for other counties and would like a way to create a loop over the files in the folder.
my code so far with NO loops:
import pandas as pd
fixtures = pd.read_csv('/Users/Desktop/Real_building_land/fixtures.txt',header = None,
encoding= 'cp037', error_bad_lines=False, sep='\t')
real_acct =pd.read_csv('/Users/Desktop/Real_acct_owner/real_acct.txt', header = None,
encoding = 'cp037', error_bad_lines=False, sep='\t')
exterior = pd.read_csv('/Users/Desktop/Real_building_land/exterior.txt', header = None,
encoding = 'cp037', error_bad_lines=False, sep='\t')
fixtures.columns = ('ACCOUNT','BUILDING_NUMBER','FIXTURE_TYPE','FIXTURE_DESCRIPTION','UNITS')
real_acct.columns = ("ACCOUNT","TAX_YEAR","MAILTO","MAIL_ADDR_1","MAIL_ADDR_2","MAIL_CITY","MAIL_STATE",
"MAIL_ZIP","MAIL_COUNTRY","UNDELIVERABLE","STR_PFX" ,"STR_NUM", "STR_NUM_SFX","STR_NAME",
"STR_SFX","STR_SFX_DIR","STR_UNIT","SITE_ADDR_1","SITE_ADDR_2","SITE_ADDR_3","STATE_CLASS",
"SCHOOL_DIST","MAP_FACET","KEY_MAP","NEIGHBORHOOD_CODE","NEIGHBORHOOD_GROUP","MARKET_AREA_1",
"MARKET_AREA_1_DSCR","MARKET_AREA_2","MARKET_AREA_2_DSCR","ECON_AREA","ECON_BLD_CLASS",
"CENTER_CODE","YR_IMPR","YR_ANNEXED","SPLT_DT","DSC_CD","NXT_BUILDING","TOTAL_BUILDING_AREA",
"TOTAL_LAND_AREA","ACREAGE","CAP_ACCOUNT","SHARED_CAD_CODE","LAND_VALUE","IMPROVEMENT_VALUE",
"EXTRA_FEATURES_VALUE" ,"AG_VALUE","ASSESSED_VALUE","TOTAL_APPRAISED_VALUE","TOTAL_MARKET_VALUE",
"PRIOR_LND_VALUE","PRIOR_IMPR_VALUE","PRIOR_X_FEATURES_VALUE","PRIOR_AG_VALUE",
"PRIOR_TOTAL_APPRAISED_VALUE","PRIOR_TOTAL_MARKET_VALUE","NEW_CONSTRUCTION_VALUE",
"TOTAL_RCN_VALUE","VALUE_STATUS","NOTICED","NOTICE_DATE","PROTESTED","CERTIFIED_DATE",
"LAST_INSPECTED_DATE","LAST_INSPECTED_BY","NEW_OWNER_DATE","LEGAL_DSCR_1","LEGAL_DSCR_2",
"LEGAL_DSCR_3","LEGAL_DSCR_4","JURS")
exterior.columns = ("ACCOUNT","BUILDING_NUMBER","EXTERIOR_TYPE","EXTERIOR_DESCRIPTION","AREA")
df = fixtures.merge(real_acct,on='ACCOUNT').merge(exterior,on='ACCOUNT')
#df = df.loc[df['ACCOUNT'] == 10020000015]
print(df.shape)
Code with Few trials with loops nothing worked:
import pandas as pd
import glob
import os
dfs = {os.path.basename(f): pd.read_csv(f, sep='\t', header=None,encoding='cp037',
error_bad_lines=False) for f in glob.glob('/Users/Desktop/Real_building_land/*.txt')}
print(dfs)
path =r'path' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
Thank you in advance.