How to add new file to dataframe - pandas

I have a folder where CSV files are stored, AT certain interval a new CSV file(SAME FORMAT) is added to the folder.
I need to detect the new file and add the contents to data frame.
My current code reads all CSV files at once and stores in dataframe , But Dataframe should get updated with the contents of new CSV when a new file(CSV) is added to the folder.
import os
import glob
import pandas as pd
os.chdir(r"C:\Users\XXXX\CSVFILES")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
df = pd.concat([pd.read_csv(f) for f in all_filenames ])

Let's say you have a path into your folder where new csv are downloaded:
path_csv = r"C:\........\csv_folder"
I assume that your dataframe (the one you want to append to) is created and that you load it into your script (you have probably updated it before, saved to some csv in another folder). Let's assume you do this:
path_saved_df = r"C:/..../saved_csv" #The path to which you've saved the previously read csv:s
filename = "my_old_files.csv"
df_old = pd.read_csv(path + '/' +filename, sep="<your separator>") #e.g. sep =";"
Then, to only read the latest addition of a csv to the folder in path you simply do the following:
list_of_csv = glob.glob(path_csv + "\\\\*.csv")
latest_csv = max(list_of_csv , key=os.path.getctime) #max ensures you only read the latest file
with open(latest_csv) as csv_file:
csv_reader = csv.reader(csv_file , delimiter=';')
new_file = pd.read_csv(latest_csv, sep="<your separator>", encoding ="iso-8859-1") #change encoding if you need to
Your new dataframe is then
New_df = pd.concat([df_old,new_file])

Related

Reading multiple csv files in AWS Sagemaker from a location in Amazon S3 Bucket

I have multiple csv files in a location in S3. The name of those files is in a date format. Example: 2021_09_30_Output.csv
I need to understand how I can read all the files in this folder while selecting only the dates that I require. An example would be reading only the files from September. ie: "2022_09_*.csv" which would read only the files from that month
Would appreciate the help. Thanks
You can create a function that will return all files from a particular date onwards using the datetime library based on the naming convention of your files. The following snippet can get you started:
import datetime
s3 = boto3.resource('s3')
BUCKET_NAME = 'name'
september_1 = datetime.datetime(2021, 9, 1)
files = get_files_after(BUCKET_NAME, september_1)
for file in files:
contents = file['Body'].read()
contents = contents.decode("utf-8")
...
def get_files_after(bucket, date):
files = []
for obj in s3.Bucket(bucket).objects.all():
key = obj.key
file_date = key[:-4] # Remove '.csv' from name
file_date = datetime.datetime.strptime(file_date, '%Y_%m_%d')
if file_date > date:
files.append(obj)
return files

How to iterate over a list of csv files and compile files with common filenames into a single csv as multiple columns

I am currently iterating through a list of csv files and want to combine csv files with common filename strings into a single csv file merging the data from the new csv file as a set of two new columns. I am having trouble with the final part of this in that the append command adds the data as rows at the base of the csv. I have tried with pd.concat, but must be going wrong somewhere. Any help would be much appreciated.
**Note the code is using Python 2 - just for compatibility with the software I am using - Python 3 solution welcome if it translates.
Here is the code I'm currently working with:
rb_headers = ["OID_RB", "Id_RB", "ORIG_FID_RB", "POINT_X_RB", "POINT_Y_RB"]
for i in coords:
if fnmatch.fnmatch(i, '*RB_bank_xycoords.csv'):
df = pd.read_csv(i, header=0, names=rb_headers)
df2 = df[::-1]
#Export the inverted RB csv file as a new csv to the original folder overwriting the original
df2.to_csv(bankcoords+i, index=False)
#Iterate through csvs to combine those with similar key strings in their filenames and merge them into a single csv
files_of_interest = {}
forconc = []
for filename in coords:
if filename[-4:] == '.csv':
key = filename[:39]
files_of_interest.setdefault(key, [])
files_of_interest[key].append(filename)
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df = buff_df.append(pd.read_csv(filename))
files_of_interest[key]=buff_df
redundant_headers = ["OID", "Id", "ORIG_FID", "OID_RB", "Id_RB", "ORIG_FID_RB"]
outdf = buff_df.drop(redundant_headers, axis=1)
If you want only to merge in one file:
paths_list=['path1', 'path2',...]
dfs = [pd.read_csv(f, header=None, sep=";") for f in paths_list]
dfs=pd.concat(dfs,ignore_index=True)
dfs.to_csv(...)

how to add header row to df.to_csv output file only after a certain condition is met

I am reading a bulk download csv file of Stock prices and splitting it into many individual csv's based on ticker, where ticker name is the name of the outputted file and where the header row which contains "ticker, date, open, high, low, close, volume is being written ONLY the the first time I run the script because if I run it again with the header set to true, it writes a new header row mixed in with the stock data. I have mode set to "a", meaning "append" because I want each new row of data added to the file. However, I see a situation now where a new ticker has appeared in the source file, and because I have the header set to False, there is no header in this newly created output file, which causes processing to fail. How can I include a condition so that it writes a header row ONLY for new files which never existed before. Here is my code. Thanks
import pandas as pd
import os
import csv
import itertools
import datetime
datetime = datetime.datetime.today().strftime('%Y-%m-%d')
filename = "Bats_"+(datetime)+".csv"
csv_file = ("H:\\EOD_DATA_RECENT\\DOWNLOADS\\"+filename)
path = 'H:\\EOD_DATA_RECENT\\VIA-API-CALL\\BATS\\'
df = pd.read_csv(csv_file)
for i, g in df.groupby('Ticker'):
# SET HEADER TO TRUE THE FIRST RUN, THEN SET TO FALSE THEREAFTER
g.to_csv(path + '{}.csv'.format(i), mode='a', header=False, index=False, index_label=None)
print(df.tail(5))
FINAL CODE SNIPPET BELOW THAT WORKS. Thanks
for i, g in df.groupby('Ticker'):
if os.path.exists(path+i+".csv"):
g.to_csv(path + '{}.csv'.format(i), mode='a', header=False, index=False, index_label=None)
else:
g.to_csv(path + '{}.csv'.format(i), mode='w', header=True, index=False, index_label=None)
print(df.tail(5))

Exporting Multiple log files data to single Excel using Pandas

How do I export multiple dataframes to a single excel, I'm not talking about merging or combining. I just want a specific line from multiple log files to be compiled to a single excel sheet. I already wrote a code but I am stuck:
import pandas as pd
import glob
import os
from openpyxl.workbook import Workbook
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
for files in read_files:
logs = pd.read_csv(files, header=None).loc[540:1060, :]
print(LBS_logs)
logs.to_excel("LBS.xlsx")
When I do this, I only get data from the first log.
Appreciate your recommendations. Thanks !
You are saving logs, which is the variable in your for loop that changes on each iteration. What you want is to make a list of dataframes and combine them all, and then save that to excel.
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
dfs = []
for file in read_files:
log = pd.read_csv(file, header=None).loc[540:1060, :]
dfs.append(log)
logs = pd.concat(logs)
logs.to_excel("LBS.xlsx")

How to skip duplicate headers in multiple CSV files having indetical columns and merge as one big data frame

I have copied 34 CSV files having identical columns in google colab and trying to merge as one big data frame. However, each CSV has a duplicate header which needs to be skipped.
The actual header anyway will be skipped while concatenating, as my CSV files having identical columns correct?
dfs = [pd.read_csv(path.join('/content/drive/My Drive/',x)skiprows=1) for x in os.listdir('/content/drive/My Drive/') if path.isfile(path.join('/content/drive/My Drive/',x))]
df = pd.concat(dfs)
Above code throwing below error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1: invalid continuation byte
Below code working for sample files,but need an efficient way to skip dup headers and merged into one data frame.Please suggest.
df1=pd.read_csv("./Aug_0816.csv",skiprows=1)
df2=pd.read_csv("./Sep_0916.csv",skiprows=1)
df3=pd.read_csv("./Oct_1016.csv",skiprows=1)
df4=pd.read_csv("./Nov_1116.csv",skiprows=1)
df5=pd.read_csv("./Dec_1216.csv",skiprows=1)
dfs=[df1,df2,df3,df4,df5]
df=pd.concat(dfs)
Have you considered using glob from the standard library?
Try this
path = ('/content/drive/My Drive/')
os.chdir(path)
allFiles = glob.glob("*.csv")
dfs = [pd.read_csv(f,header=None,error_bad_lines=False) for f in allFiles]
#or if you know the specific delimiter for your csv
#dfs = [pd.read_csv(f,header=None,delimiter='yourdelimiter') for f in allFiles]
df = pd.concat(dfs)
Try this, the most generic script for concatenating multiple 'n' csv files in a specific path with a common file name format!
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f,**kwargs) for f in flist], ignore_index=True)
path = r"C:\Users\Jyotsna\Documents"
fmask = os.path.join(path, 'Detail**.csv')
df = get_merged_csv(glob.glob(fmask), index_col=None)
df.head()
If you want to skip some fixed rows and/or columns in each of the files before concatenating, edit the code accordingly on this line!
return pd.concat([pd.read_csv(f, skiprows=4,usecols=range(9),**kwargs) for f in flist], ignore_index=True)