Given a column with S3 paths, I want to read them and store the concatenated version of it. Pyspark - amazon-s3

I have a column with s3 file paths, I want to read all those paths, concatenate it later in PySpark

You can get the paths as a list using map and collect. Iterate over that list to read the paths and append the resulting spark dataframes into another list. Use the second list (which is a list of spark dataframes) to union all the dataframes.
# get all paths in a list
list_of_paths = data_sdf.rdd.map(lambda r: r.links).collect()
# read all paths and store the df in a list as element
list_of_sdf = []
for path in list_of_paths:
list_of_sdf.append(spark.read.parquet(path))
# check using list_of_sdf[0].show() or list_of_sdf[1].printSchema()
# run union on all of the stored dataframes
import pyspark
final_sdf = reduce(pyspark.sql.dataframe.DataFrame.unionByName, list_of_sdf)
Use the final_sdf dataframe to write to a new parquet file.

You can supply multiple paths to the Spark parquet read function. So, assuming these are paths to parquet files that you want to read into one DataFrame, you can do something like:
list_of_paths = [r.links for links_df.select("links").collect()]
aggregate_df = spark.read.parquet(*list_of_paths)

Related

Merging multiple files in Pig

I have several files (around 10 files) which I would like to merge together in Pig:
Student01.txt
Student02.txt
...
Student10.txt
I am aware that I could merge two datasets together by:
data = UNION Student01, Student02
Is there any way that I could iterate over a loop to merge the dataset from Student01 to Student10?
Assuming the files are in the same format, then LOAD command allows you to read all files if you provide it a directory or a glob.
From docs -
The input data to the load can be a file, a directory or a glob
Example
STUDENTS = LOAD("/path/to/students/Student*.txt") USING PigStorage();

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file
The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Parse CSV to Extract Filenames and Rename Files (Python)

I'm looking to try and extract filenames from a comma CSV, rename the files they refer to by sequential numbering, then going back to the CSV in the process.
I am able to extract all the first column:
import pandas as pd
my_data = pd.read_csv('test.csv', sep=',', header=0, usecols=[0])
And then the list of entries that I need:
values = list(x for x in my_data["full path"])
From there I want to use that path to rename each file sequentially as per its path(1.msg, 2.msg, 3.msg), then go back and update the CSV with the "new" path.
My CSV looks like:
full path, name, data1, data2
\path\to\a\file.msg,data,moredata,evenmoredata
Existing file path:
\path\to\a\file.msg
New file path:
\path\to\a\1.msg
Any help is appreciated.
You can directly modify the dataframe and the file by iterating trough the dataframe itself. Once you have edited the desired rows, you persist the dataframe by rewriting it to a csv file (the same if you want to overwrite it). I assume here that file_path is the name of the column containing the filepath: change it accordingly.
Explanations come with the code comments
import os
import pandas as pd
# I'm assuming everything is correct up to the data reading
df = pd.read_csv('test.csv', sep=',', header=0, usecols=[0])
# You can iterate trough the index of the dataframe itself. Were it inconsistent, you can use a custom one (here `k`)
k = 0
for index, row in df.iterrows():
# Extract the current file path, `/path/to/file.msg
fp = row['file_path']
# Extract the filename, e.g. `file.msg`
fn = os.path.basename(fp)
# Extract the dir path, e.g. `/path/to
dir_path = os.path.dirname(fp)
# split and separate from the extention
name, ext = os.path.splitext(fn)
# Reconstruct the new filepath
new_path = os.path.join(dir_path, str(k) + ext)
# Important to try in order to avoid any prohibited access to the file or its absence
try:
os.rename(fp,new_path)
except:
# Here you can enrich the code to handle different specific exceptions
print(f'Error: file {fp} cannot be renamed')
else:
# If the file was correctly replaced, you can modify the dataframe. Put this code outside the try-except-else block to modify the dataframe in any case. NOTE: the index here MUST be that of the dataframe! Not your custom one.
df.at[index, 'file_path'] = new_path
k = k + 1
# Overwrite the dataframe. Adjust accordingly!
df.to_csv('test.csv')
Disclaimer: I couldn't try the code above. In case of any slip, I'll correct it as soon as possible

How to export data frame back to the original csv file that I imported it from?

I have a list of DataFrames by the name of "dfs" which has 139 DataFrames. I originally imported csv files to python and have deleted first few rows from each data frame. Now I wish to save these new files back in their original positions. How can I do that. My new data is saved in another list named final. Also please tell me if I can make my code more efficient as I am a new to python.
dfs = [pd.read_csv(filename) for filename in filenames]
final=[]
for i in range(139):
a= dfs[i].iloc[604:,]
final.append(a)
Not sure if I've understood it correctly, if you want to write df to csv to the same as when you made df but this time opposite way.
for df, filename in zip(final, filenames):
path = f'{filename}.csv'
df.to_csv(path)