Clean data with pandas - pandas

I have multiple files in a folder where I need to rename the headers, split after the first | and remove 'p.'.
The code looks like this
path = "/home/kristina/snpeff_analysis/a.a/result/Ann.vcf/TEST_P.G_ann.vcf/PLAY.TEST"
all_files = glob.glob(path + "/*_G.P.vcf")
#print(all_files)
aa_df = []
for filename in all_files:
aa_df = pd.read_csv(filename, sep='\t')
new_header = {'Gene':'Gene', 'P':'Aminoacids'}
aa_df.rename(columns=new_header, inplace=True)
aa_df.to_csv(filename, index=False, sep='\t')
#%%
#split & replace
def get_element(my_list, position):
return my_list[position]
df = aa_df
for filename in all_files:
df.Gene.str.split('|').apply(get_element, position=0), df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','').to_csv(filename, index=False, sep='\t')
Ex looking into one file
Gene Aminoacids
gyrA|Rv0007|ppiA|dnaN|recF|Rv0004|gyrB|Rv0008c p.Ser95Thr|.|.|.|.|.|.|.
rpoB|rpoC|atsD|vapB8|vapC8|Rv0666 p.His445Asp|.|.|.|.|.
Rv1313c|Rv1314c|atpC|Rv1312|murA|ogt|rrs .|.|.|.|.|.|.
tlyA|ppnK|recN|Rv1697|mctB|mpg|tyrS|lprJ|Rv1691|Rv1692|Rv1693 p.Leu11Leu|.|.|.|.|.|.|.|.|.|.
The issue that I have is that when running the last part of my script it only outputs the split on the Aminoacids column.
Aminoacids
Ser95Thr
His445Asp
.
Leu11Leu
But when changing the last command to end with .head instead of .to_csv the ouput in the interactive window looks correct.
(0 gyrA
1 rpoB
2 Rv1313c
3 tlyA
Name: Gene, dtype: object,
<bound method NDFrame.head of
0 Ser95Thr
1 His445Asp
2 .
3 Leu11Leu
Name: Aminoacids, dtype: object>)
What am I doing wrong?

IIUC you just need to assign your changes to the columns before exporting it.
df['Gene'] = df['Gene'].str.split('|').apply(get_element, position=0)
df['Aminoacids'] = df['Aminoacids'].str.split('|').apply(get_element, position=0).str.replace('p.','', regex=True)
df.to_csv(out_path, index=False, sep='\t')

Related

How to save each iteration of a loop to a single .csv file using pandas

I'm using the code below to scrape the latest daily prices for a number of funds:
import requests
import pandas as pd
urls = ['https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=IE00BHBX0Z19:EUR',
'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1076093779:EUR']
def format_date(date):
date = date.split(',')[-2][1:] + date.split(',')[-1]
return pd.Series({'Date': date})
for url in urls:
ISIN = url.split('=')[-1].replace(':', '_')
ISIN = ISIN[:-4]
ISIN = ISIN + ".OTHER"
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df['Date'] = df['Date'].apply(format_date)
del df['Open']
del df['High']
del df['Low']
del df['Volume']
df = df.rename(columns={'Close': 'last_traded_price'})
df = df.rename(columns={'Date': 'last_traded_on'})
df.insert(2, "id", ISIN)
df=df.head(1)
print (df)
df.to_csv(r'/Users/.../Testdata.csv', index=False)
At the moment, the Testdata.csv file is being overwritten everytime a new loop starts and I would like to find a way to save all of the data into the .csv file with this format:
Col 1 Col 2 Col 3
last_traded_on last_traded_price id
Oct 07 2021 78.83 LU0526609390.OTHER
Oct 07 2021 11.1 IE00BHBX0Z19.OTHER
Oct 07 2021 155.56 LU1076093779.OTHER
I need to find a way to somehow save the data to the .csv file outside of the loop but I'm really struggling to find a way to do it.
Thank you
Use a file handler:
with open(r'/Users/.../Testdata.csv', 'w') as csvfile
# Here, you need to write headers:
# csvfile.write("header1,header2,header3\n")
for url in urls:
ISIN = url.split('=')[-1].replace(':', '_')
... # The rest of your code
df.to_csv(csvfile, index=False, header=False)
Or the best practice is to collect each dataframe in a list and use pd.concat to merge all of them and save to a file:
dfs = []
for url in urls:
ISIN = url.split('=')[-1].replace(':', '_')
... # The rest of your code
dfs.append(df)
pd.concat(dfs).to_csv(r'/Users/.../Testdata.csv', index=False)
Note: your output looks like to be an output of df.to_string() rather than df.to_csv

Copy/assign a Pandas dataframe based on their name in a for loop

I am relatively new with python - and I am struggling to do the following:
I have a set of different data frames, with sequential naming (df_i), which I want to access in a for loop based on their name (with an string), how can I do that? e.g.
df_1 = pd.read_csv('...')
df_2 = pd.read_csv('...')
df_3 = pd.read_csv('...')
....
n_df = 3
for i in range(len(n_df)):
df_namestr= 'df_' + str(i+1)
# ---------------------
df_temp = df_namestr
# ---------------------
# Operate with df_temp. For i+1= 1, df_temp should be df_1
Kind regards,
DF
You can try something like that:
for n in range(1, n_df+1):
df_namestr = f"df_{n}"
df_tmp = locals().get(df_namestr)
if not isinstance(df_tmp, pd.DataFrame):
continue
print(df_namestr)
print(df_tmp)
Refer to the documentation of locals() to know more.
Would it be better to approach the accessing of multiple dataframes by reading them into a list?
You could put all the csv files required in a subfolder and read them all in. Then they are in a list and you can access each one as an item in that list.
Example:
import pandas as pd
import glob
path = r'/Users/myUsername/Documents/subFolder'
csv_files = glob.glob(path + "/*.csv")
dfs = []
for filename in csv_files:
df = pd.read_csv(filename)
dfs.append(df)
print(len(dfs))
print(dfs[1].head())

Pandas saving in text format

I am trying to save the output, which is a number ,to a text format in pandas after working on the dataset.
import pandas as pd
df = pd.read_csv("sales.csv")
def HighestSales():
df.drop(['index', "month"], axis =1, inplace = True)
df2 = df.groupby("year").sum()
df2 = df2.sort_values(by = 'sales', ascending = True).reset_index()
df3 = df2.loc[11, 'year']
df4 = pd.Series(df3)
df5 = df4.iloc[0]
#*the output here is 1964 , which alone needs to be saved in the text file*.
df5.to_csv("modified.txt")
HighestSales()
But I get 'numpy.int64' object has no attribute 'to_csv'- this error . Is there a way to save just one single value in the text file?
you can do:
# open a file named modified.txt
with open('modified.txt', 'w') as f:
# df5 is just an integer of 196
# and write 1964 plus a line break
f.write(df5 + '\n')
You cannot save a single value to csv by using "pd.to_csv". In your case you should convert it into DataFrame again and then saving it. If you want to see only the number in .txt file, you need to add some parameters:
result = pd.DataFrame(df5)
result.to_csv('modified.txt', index=False, header=False)

How can I use a loop to apply a function to a list of csv files?

I'm trying to loop through all files in a directory and add "indicator" data to them. I had the code working where I could select 1 file and do this, but now am trying to make it work on all files. The problem is when I make the loop it says
ValueError: Invalid file path or buffer object type: <class 'list'>
The goal would be for each loop to read another file from list, make changes, and save file back to folder with changes.
Here is complete code w/o imports. I copied 1 of the "file_path"s from the list and put in comment at bottom.
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file in listdrs_path:
file_path = listdrs_path
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
###################################################
#here is an example of what path from list looks like
#'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/A.csv'
The problem is in line number 13 and 14. Your filename is in variable file but you are using file_path which you've assigned the file list. Because of this you are getting ValueError. Try this:
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
Let me know if it helps.

Pandas - Trying to save a set of files by reading it using Pandas but only the latest file gets saved

I am trying to read a set of txt files into Pandas as below. I see I am able to read them to a Dataframe however when I try to save the Dataframe it only saves the last file it read. However when I perform print(df) it prints all the records.
Given below is the code I am using:
files = '/users/user/files'
list = []
for file in files:
df = pd.read_csv(file)
list.append(df)
print(df)
df.to_csv('file_saved_path')
Could anyone advice why is the last file only being saved to the csv file and now the entire list.
Expected output:
output1
output2
output3
Current output:
output1,output2,output3
Try this:
path = '/users/user/files'
for id in range(len(os.listdir(path))):
file = os.listdir(path)[id]
data = pd.read_csv(path+'/'+file, sep='\t')
if id == 0:
df1 = data
else:
data = pd.concat([df1, data], ignore_index=True)
data.to_csv('file_saved_path')
First change variable name list, because code word in python (builtin), then for final DataFrame use concat:
files = '/users/user/files'
L = []
for file in files:
df = pd.read_csv(file)
L.append(df)
bigdf = pd.concat(L, ignore_index=True)
bigdf.to_csv('file_saved_path')