How to overwrite a text file in spark scala - dataframe

Need to overwrite a text file from a dataframe with 4 columns in scala spark
I tried the below command
result.rdd.map(x =\> x.mkString(“\\t”)).saveAsTextFile("/user/Output")
It's working
I want to overwrite the file next time
And also specify where do i need to mention repartition parameter.

DataFrameWriter has method mode to specify a saveMode which can set to overwrite and method partitionBy to partition data by key in your parameter.example code:
df.write
.mode("overwrite")
.partitionBy("partition_key")
.save("/path/to/table")

Related

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Export panda dataframe to CSV

I have a code where at the end I export a dataframe in CSV format. However each time I run my code it replaces the previous CSV while I would like to accumulate my csv files
Do you now a method to do this ?
dfind.to_csv(r'C:\Users\StageProject\Indicateurs\indStat.csv', index = True, header=True)
Thanks !
The question is really about how you want to name your files. The easiest way is just to attach a timestamp to each one:
import time
unix_time = round(time.time())
This should be unique under most real-world conditions because time doesn't go backwards and Python will give time.time() only in UTC. Then just save to the path:
rf'C:\Users\StageProject\Indicateurs\indStat_{unix_time}.csv'
If you want to do a serial count, like what your browser does when you save multiple versions, you will need to iterate through the files in that folder and then keep adding one to your suffix until you get to a file path that does not conflict, then save thereto.

How to use spark toLocalIterator to write a single file in local file system from cluster

I have a pyspark job which writes my resultant dataframe in local filesystem. Currently it is running in local mode and so I am doing coalesce(1) to get a single file as below
file_format = 'avro' # will be dynamic and so it will be like avro, json, csv, etc
df.coalesce.write.format(file_format).save('file:///pyspark_data/output')
But I see a lot of memory issues (OOM) and takes longer time as well. So I want to run this job with master as yarn and mode as client. And so to write the result df into a single file in localsystem, I need to use toLocalIterator which yields Rows. How can I stream these Rows into a file of required format (json/avro/csv/parquet and so on)?
file_format = 'avro'
for row in df.toLocalIterator():
# write the data into a single file
pass
You get OOM error because you try to retrieve all the data into a single partition with: coalesce(1)
I dont recommend to use toLocalIterator because you will re-rewrite a custom writer for every format and you wont have parallele writing.
You first solution is a good one :
df.write.format(file_format).save('file:///pyspark_data/output')
if you use hadoop you can retrieve all the data into one on filesysteme this way : (it work for csv, you can try for other) :
hadoop fs -getmerge <HDFS src> <FS destination>

Output path folders to Data Lake Store without "ColumnName="

Is posible to use the function partitionBy or other without returning the path "ColumnName=Value"?
I´m using a python notebook in azure databricks to send a csv file to Azure Data Lake Store. The Cmd used is the following:
%scala
val filepath= "dbfs:/mnt/Test"
Sample
.coalesce(1)
.write
.mode("overwrite")
.partitionBy("Year","Month","Day")
.option("header", "true")
.option("delimiter",";")
.csv(filepath)
Expecting to have this path:
/Test/2018/12/11
Instead of:
/Test/Year=2018/Month=12/Day=11
This is expected behavior.
Spark uses directory path for partition with columns names.
If you need specific directory you should use downstream process to rename directory or you can filter your df and save one by one in the specific directory.

How to read tabular data on s3 in pyspark?

I have some tab separated data on s3 in a directory s3://mybucket/my/directory/.
Now, I am telling pyspark that I want to use \t as the delimiter to read in just one file like this:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext, Row
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql.functions import col, date_sub, log, mean, to_date, udf, unix_timestamp
from pyspark.sql.window import Window
from pyspark.sql import DataFrame
sc =SparkContext()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
indata_creds = sqlContext.read.load('s3://mybucket/my/directory/onefile.txt').option("delimiter", "\t")
But it is telling me: assertion failed: No predefined schema found, and no Parquet data files or summary files found under s3://mybucket/my/directory/onefile.txt
How do I tell pyspark that this is a tab-delimited file and not a parquet file?
Or, is there an easier way to do read in these files in the entire directory all at once?
thanks.
EDIT: I am using pyspark version 1.6.1 *
The files are on s3, so I am not able to use the usual:
indata_creds = sqlContext.read.text('s3://mybucket/my/directory/')
because when I try that, I get java.io.IOException: No input paths specified in job
Anything else I can try?
Since you're using Apache Spark 1.6.1, you need spark-csv to use this code:
indata_creds = sqlContext.read.format('com.databricks.spark.csv').option('delimiter', '\t').load('s3://mybucket/my/directory/onefile.txt')
That should work!
Another option is for example this answer. Instead of splitting this by the comma you could use to split it by tabs. And then load the RDD into a dataframe. However, the first option is easier and already loads it into a dataframe.
For your alternative in your comment, I wouldn't convert it to parquet files. There is no need for it except if your data is really huge and compression is necessary.
For your second question in the comment, yes it is possible to read the entire directory. Spark supports regex/glob. So you could do something like this:
indata_creds = sqlContext.read.format('com.databricks.spark.csv').option('delimiter', '\t').load('s3://mybucket/my/directory/*.txt')
By the way, why are you not using 2.x.x? It's also available on aws.
The actual problem was that I needed to add my AWS keys to my spark-env.sh file.