Export panda dataframe to CSV - pandas

I have a code where at the end I export a dataframe in CSV format. However each time I run my code it replaces the previous CSV while I would like to accumulate my csv files
Do you now a method to do this ?
dfind.to_csv(r'C:\Users\StageProject\Indicateurs\indStat.csv', index = True, header=True)
Thanks !

The question is really about how you want to name your files. The easiest way is just to attach a timestamp to each one:
import time
unix_time = round(time.time())
This should be unique under most real-world conditions because time doesn't go backwards and Python will give time.time() only in UTC. Then just save to the path:
rf'C:\Users\StageProject\Indicateurs\indStat_{unix_time}.csv'
If you want to do a serial count, like what your browser does when you save multiple versions, you will need to iterate through the files in that folder and then keep adding one to your suffix until you get to a file path that does not conflict, then save thereto.

Related

Pandas, Glob, use wildcard to stand for end of filename

I am programming (Pandas) around a problem where certain generated files are saved with a date attached to the file. For example: file-name_20220814.csv.
However, these files change each time they are generated, creating a new ending to the file. What is the best way to use a wildcard to stand for these file date endings?
Glob? How would I do that in the following code:
df1 = pd.read_csv('files/file-name_20220816.csv')
Answer provided by #mitoRibo:
pd.read_csv(glob('files/file-name_*csv')[0])

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

How to export data frame back to the original csv file that I imported it from?

I have a list of DataFrames by the name of "dfs" which has 139 DataFrames. I originally imported csv files to python and have deleted first few rows from each data frame. Now I wish to save these new files back in their original positions. How can I do that. My new data is saved in another list named final. Also please tell me if I can make my code more efficient as I am a new to python.
dfs = [pd.read_csv(filename) for filename in filenames]
final=[]
for i in range(139):
a= dfs[i].iloc[604:,]
final.append(a)
Not sure if I've understood it correctly, if you want to write df to csv to the same as when you made df but this time opposite way.
for df, filename in zip(final, filenames):
path = f'{filename}.csv'
df.to_csv(path)

How to update the "last update" date in text files automatically according their modification date in OS?

There are sevaral source files in VHDL. All files have a header which gives the file name, creation date and description among other things. One of these things is the last update date. All files are version controlled in Git.
What happens is that often the files are modified, commited and pushed up. However, the last update date is not updated often. This happens by mistake since so many different files are worked on at different times and one might forget to always change the "last update" part of the file header to the latest date when it has actually been changed.
I want to automate this process and believe there are many different ways to do this.
A script of some sort, must check the last update date in the text file header. Then, if it is different from the actual last modified date that can be accessed through properties of the file in the file-system, the last update date in the text must be updated to the last modified date value. What would be the most optimal way to do this? A Python script, Bash script or something else?
Basically I want to do this when the files are being commited into Git. It should ideally happen automatically but running one line in terminal to execute script is not a big deal perhaps. The check is required on the files that are being commited and pushed up.
I'm not a Python programmer, but I made a little script to hopefully help you out. Maybe this fits your needs.
What the script should do:
Get all files form the path (here c:\Python) which have the extension .vdhl
Loop over the files and extract the date from line 9 via regex
Get the last modified date from the file
If last modified > then the date in the file, then update the file
import os
import re
import glob
import datetime
path = r"c:\Python"
mylist = [f for f in glob.glob("*.vhdl")]
print(mylist)
for i in mylist:
filepath = os.path.join(path, i)
with open(filepath, 'r+') as f:
content = f.read()
last_update = re.findall("Last\supdate\:\s+(\d{4}-\d{2}-\d{2})", content)
modified = os.path.getmtime(filepath)
modified_readable = str(datetime.datetime.fromtimestamp(modified))[:10]
#print(content)
#print(last_update)
#print(modified_readable)
#print("Date modified:", datetime.datetime.fromtimestamp(modified))
if (modified_readable > last_update[0]):
print(filepath, 'UPDATE')
text = re.sub(last_update[0], modified_readable, content)
f.seek(0)
f.write(text)
f.truncate()
else:
print(filepath, 'NO CHANGE')

Appending csv files in directory into a pandas dataframe

I have written a scraper which downloads daily flight prices, stores them as pandas data frames and saves them off as csv files in a given folder. I am now trying to combine these csv files into pandas for data analysis using append, but end result is an empty data frame.
Specifically, individual csv files are loaded correctly into pandas, but the append seems to fail (and several methods found on stackoverflow posts don't seem to work). Code is below, any pointers? Thanks!
directory = os.path.join("C:\\Testfolder\\")
for root,dirs,files in os.walk(directory):
for file in files:
daily_flight_df = (pd.read_csv(directory+file,sep=";")) #loads csv into dataframe - works correctly
cons_flight_df.append(daily_flight_df) #appends daily flight prices into a pandas with consolidated flight prices - does not seem to work
print(cons_flight_df) #currently prints out an empty data frame
cons_flight_df.to_csv('C:\\Testfolder\\test.csv') #currently returns empty csv file
In pandas, the append method isn't in place. You need to assign it.
cons_flight_df = cons_flight_df.append(daily_flight_df)