I have written a scraper which downloads daily flight prices, stores them as pandas data frames and saves them off as csv files in a given folder. I am now trying to combine these csv files into pandas for data analysis using append, but end result is an empty data frame.
Specifically, individual csv files are loaded correctly into pandas, but the append seems to fail (and several methods found on stackoverflow posts don't seem to work). Code is below, any pointers? Thanks!
directory = os.path.join("C:\\Testfolder\\")
for root,dirs,files in os.walk(directory):
for file in files:
daily_flight_df = (pd.read_csv(directory+file,sep=";")) #loads csv into dataframe - works correctly
cons_flight_df.append(daily_flight_df) #appends daily flight prices into a pandas with consolidated flight prices - does not seem to work
print(cons_flight_df) #currently prints out an empty data frame
cons_flight_df.to_csv('C:\\Testfolder\\test.csv') #currently returns empty csv file
In pandas, the append method isn't in place. You need to assign it.
cons_flight_df = cons_flight_df.append(daily_flight_df)
Related
file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file
The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.
I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.
I have a list of DataFrames by the name of "dfs" which has 139 DataFrames. I originally imported csv files to python and have deleted first few rows from each data frame. Now I wish to save these new files back in their original positions. How can I do that. My new data is saved in another list named final. Also please tell me if I can make my code more efficient as I am a new to python.
dfs = [pd.read_csv(filename) for filename in filenames]
final=[]
for i in range(139):
a= dfs[i].iloc[604:,]
final.append(a)
Not sure if I've understood it correctly, if you want to write df to csv to the same as when you made df but this time opposite way.
for df, filename in zip(final, filenames):
path = f'{filename}.csv'
df.to_csv(path)
I have a code where at the end I export a dataframe in CSV format. However each time I run my code it replaces the previous CSV while I would like to accumulate my csv files
Do you now a method to do this ?
dfind.to_csv(r'C:\Users\StageProject\Indicateurs\indStat.csv', index = True, header=True)
Thanks !
The question is really about how you want to name your files. The easiest way is just to attach a timestamp to each one:
import time
unix_time = round(time.time())
This should be unique under most real-world conditions because time doesn't go backwards and Python will give time.time() only in UTC. Then just save to the path:
rf'C:\Users\StageProject\Indicateurs\indStat_{unix_time}.csv'
If you want to do a serial count, like what your browser does when you save multiple versions, you will need to iterate through the files in that folder and then keep adding one to your suffix until you get to a file path that does not conflict, then save thereto.
I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?
You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question