How to write a tab.gz file using pyspark dataframe - apache-spark-sql

I have a Pyspark dataframe and I want my output files to be in tab.gz extensions.
df.write\
.option("delimiter", "\t")\
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")\
.save(
s3_directory,
format='csv',
header=True,
emptyValue='',
compression="gzip"
)
this creates the output files as
part-xyz.csv.gz
how can I change the config to make it save as part-xyz.tab.gz please?

As much as "tab.gz" looks like a typo, have you tried to specify "path" argument:
file_path = s3_directory + "part-xyz.tab.gz"
df.write\
.option("delimiter", "\t")\
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")\
.save(
path=file_path,
format='csv',
header=True,
emptyValue='',
compression="gzip"
)

Related

How to iterate over a list of csv files and compile files with common filenames into a single csv as multiple columns

I am currently iterating through a list of csv files and want to combine csv files with common filename strings into a single csv file merging the data from the new csv file as a set of two new columns. I am having trouble with the final part of this in that the append command adds the data as rows at the base of the csv. I have tried with pd.concat, but must be going wrong somewhere. Any help would be much appreciated.
**Note the code is using Python 2 - just for compatibility with the software I am using - Python 3 solution welcome if it translates.
Here is the code I'm currently working with:
rb_headers = ["OID_RB", "Id_RB", "ORIG_FID_RB", "POINT_X_RB", "POINT_Y_RB"]
for i in coords:
if fnmatch.fnmatch(i, '*RB_bank_xycoords.csv'):
df = pd.read_csv(i, header=0, names=rb_headers)
df2 = df[::-1]
#Export the inverted RB csv file as a new csv to the original folder overwriting the original
df2.to_csv(bankcoords+i, index=False)
#Iterate through csvs to combine those with similar key strings in their filenames and merge them into a single csv
files_of_interest = {}
forconc = []
for filename in coords:
if filename[-4:] == '.csv':
key = filename[:39]
files_of_interest.setdefault(key, [])
files_of_interest[key].append(filename)
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df = buff_df.append(pd.read_csv(filename))
files_of_interest[key]=buff_df
redundant_headers = ["OID", "Id", "ORIG_FID", "OID_RB", "Id_RB", "ORIG_FID_RB"]
outdf = buff_df.drop(redundant_headers, axis=1)
If you want only to merge in one file:
paths_list=['path1', 'path2',...]
dfs = [pd.read_csv(f, header=None, sep=";") for f in paths_list]
dfs=pd.concat(dfs,ignore_index=True)
dfs.to_csv(...)

How to add new file to dataframe

I have a folder where CSV files are stored, AT certain interval a new CSV file(SAME FORMAT) is added to the folder.
I need to detect the new file and add the contents to data frame.
My current code reads all CSV files at once and stores in dataframe , But Dataframe should get updated with the contents of new CSV when a new file(CSV) is added to the folder.
import os
import glob
import pandas as pd
os.chdir(r"C:\Users\XXXX\CSVFILES")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
df = pd.concat([pd.read_csv(f) for f in all_filenames ])
Let's say you have a path into your folder where new csv are downloaded:
path_csv = r"C:\........\csv_folder"
I assume that your dataframe (the one you want to append to) is created and that you load it into your script (you have probably updated it before, saved to some csv in another folder). Let's assume you do this:
path_saved_df = r"C:/..../saved_csv" #The path to which you've saved the previously read csv:s
filename = "my_old_files.csv"
df_old = pd.read_csv(path + '/' +filename, sep="<your separator>") #e.g. sep =";"
Then, to only read the latest addition of a csv to the folder in path you simply do the following:
list_of_csv = glob.glob(path_csv + "\\\\*.csv")
latest_csv = max(list_of_csv , key=os.path.getctime) #max ensures you only read the latest file
with open(latest_csv) as csv_file:
csv_reader = csv.reader(csv_file , delimiter=';')
new_file = pd.read_csv(latest_csv, sep="<your separator>", encoding ="iso-8859-1") #change encoding if you need to
Your new dataframe is then
New_df = pd.concat([df_old,new_file])

How to get all the records in double quotes in csv file using spark dataframe?

I am trying to save spark dataframe into csv file but I want all the records in double quotes but it is not generating. Could you help me how to do this?
Example:
Source_System|Date|Market_Volume|Volume_Units|Market_Value|Value_Currency|Sales_Channel|Competitor_Name
IMS|20080628|183.0|16470.0|165653.256349|AUD|AUSTRALIA HOSPITAL|PFIZER
Desirable Output:
Source_System|Date|Market_Volume|Volume_Units|Market_Value|Value_Currency|Sales_Channel|Competitor_Name
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
Code I am running:
df4.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').option("quoteAll", 'True').save(Output_Path_ASPAC,quote = '',sep='|',header='True',nullValue=None)
You can just use df.write.csv with quoteAll set to True:
df4.repartition(1).write.csv(Output_Path_ASPAC, quote='"', header=True,
quoteAll=True, sep='|', mode='overwrite')
Which produces, with your example data:
"Source_System"|"Date"|"Market_Volume"|"Volume_Units"|"Market_Value"|"Value_Currency"|"Sales_Channel"|"Competitor_Name"
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)
If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)

How to add Schema to a file from another File in spark Scala

I am working in Spark and Using Scala
I am having two csv files, one having the column names and other having data, how I can integrate both of them so that I can make a resultant file with schema and data, then I have to apply operations on that file like groupby, cout, etc as I need to count the distinct values from those columns.
So can anyone help out here will be really helpful
I wrote the below code made two DF from both the file after reading them than I joined both the DF using union now how I can make the first row as schema , or anyother way to proceed with this . Anyone can suggest .
val sparkConf = new SparkConf().setMaster("local[4]").setAppName("hbase sql")
val sc = new SparkContext(sparkConf)
val spark1 = SparkSession.builder().config(sc.getConf).getOrCreate()
val sqlContext = spark1.sqlContext
val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val lines = spark1.sparkContext.textFile("C:/Users/ayushgup/Downloads/home_data_usage_2018122723_1372672.csv").map(lines=>lines.split("""\|""")).toDF()
val header = spark1.sparkContext.textFile("C:/Users/ayushgup/Downloads/Header.csv").map(lin=>lin.split("""\|""")).toDF()
val file = header.unionAll(lines).toDF()
spark.sparkContext.textFile() will return rdd and will not infer schema, even if you are doing a .toDF() on top of that rdd.
sc.textFile() is for reading unstructured text files. You should use
spark.read.format("csv").option("header",true").option("inferSchema","true").load("..path.to.csv")
to get the schema from headers.
It is better you cat the files together, create anew csv and read them in HDFS
cat header.csv home_data_usage_2018122723_1372672.csv >> new_home_data_usage.csv
and then
hadoop fs -copyFromLocal new_home_data_usage.csv <hdfs_path>
then use
spark.read.format("csv").option("header",true").option("inferSchema","true").load("..path.to.csv")