Pyspark : how to get specific file based on date to load into dataframe from list of file

Pyspark : how to get specific file based on date to load into dataframe from list of file - dataframe

I'm trying to load a specific file from group of file.
example : I have files in hdfs in this format app_name_date.csv, i have 100's of files like this in a directory. i want to load a csv file into dataframe based on date.
dataframe1 = spark.read.csv("hdfs://XXXXX/app/app_name_+$currentdate+.csv")
but its throwing error since $currentdate is not accepting and says file doesnot exists
error :
pyspark.sql.utils.AnalysisException: Path does not exist: hdfs://XXXXX/app/app_name_+$currentdate+.csv"
any idea how to do this in pyspark

You can format the string with:
from datetime import date
formatted = date.today().strftime("%d/%m/%Y")
f"hdfs://XXXXX/app/app_name_{formatted}.csv"
Out[25]: 'hdfs://XXXXX/app/app_name_02/03/2022.csv'

use this option from datetime package

Related

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file

The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)

One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Save the file in a different folder using python

I have a pandas dataframe and I would like to save it as a text file to another folder. What I tried so far?
import pandas as pd
df.to_csv(path = './output/filename.txt')
This does not save the file and gives me an error. How do I save the dataframe (df) into the folder called output?

the first arguement name of to_csv() is path_or_buf either change it or just remove it
df.to_csv('./output/filename.txt')

Pandas read_csv error when file name starts with the letter f

So I'm using Pandas library on jupyter notebook and when I try to read my csv file and the name of the file starts with one of the letters a,b,c,d,e or f I get an error
For example when I use this command
df = pd.read_csv('~\Documents\WorkBHO\final.csv')
I get this error
[Errno 2] File b'C:\Users\bho\Documents\WorkBHO\x0cinal.csv' does not exist: b'C:\Users\bho\Documents\WorkBHO\x0cinal.csv'
but when I change the name of the file to "pinal" for example it works just fine.
Why does this happen and how can I read the file without changing the first letter in its name

The backslash needs to be escaped in strings. You need to write either
df = pd.read_csv('~\\Documents\\WorkBHO\\final.csv')
or
df = pd.read_csv(r'~\Documents\WorkBHO\final.csv')

Cannot open a csv file

I have a csv file on which i need to work in my jupyter notebook ,even though i am able to view the contents in the file using the code in the picture
When i am trying to convert the data into a data frame i get a "no columns to parse from file error"
i have no headers. My csv file looks like this and also i have saved it in the UTF-8 format

Try to use pandas to read the csv file:
df = pd.read_csv("BON3_NC_CUISINES.csv)
print(df)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark : how to get specific file based on date to load into dataframe from list of file - dataframe

You can format the string with: from datetime import date formatted = date.today().strftime("%d/%m/%Y") f"hdfs://XXXXX/app/app_name_{formatted}.csv" Out[25]: 'hdfs://XXXXX/app/app_name_02/03/2022.csv'

use this option from datetime package

Related

how to read data from multiple folder from adls to databricks dataframe

pyspark dataframe writing csv files twice in s3

Save the file in a different folder using python

Pandas read_csv error when file name starts with the letter f

Cannot open a csv file

Categories

Resources