How to use spark toLocalIterator to write a single file in local file system from cluster - dataframe

I have a pyspark job which writes my resultant dataframe in local filesystem. Currently it is running in local mode and so I am doing coalesce(1) to get a single file as below
file_format = 'avro' # will be dynamic and so it will be like avro, json, csv, etc
df.coalesce.write.format(file_format).save('file:///pyspark_data/output')
But I see a lot of memory issues (OOM) and takes longer time as well. So I want to run this job with master as yarn and mode as client. And so to write the result df into a single file in localsystem, I need to use toLocalIterator which yields Rows. How can I stream these Rows into a file of required format (json/avro/csv/parquet and so on)?
file_format = 'avro'
for row in df.toLocalIterator():
# write the data into a single file
pass

You get OOM error because you try to retrieve all the data into a single partition with: coalesce(1)
I dont recommend to use toLocalIterator because you will re-rewrite a custom writer for every format and you wont have parallele writing.
You first solution is a good one :
df.write.format(file_format).save('file:///pyspark_data/output')
if you use hadoop you can retrieve all the data into one on filesysteme this way : (it work for csv, you can try for other) :
hadoop fs -getmerge <HDFS src> <FS destination>

Related

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Redshift Unload command with CSV extension

I'm using the following Unload command -
unload ('select * from '')to 's3://**summary.csv**'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key=''' parallel off allowoverwrite CSV HEADER;
The file created in S3 is summary.csv000
If I change and remove the file extension from the command like below
unload ('select * from '')to 's3://**summary**'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key=''' parallel off allowoverwrite CSV HEADER;
The file create in S3 is summary000
Is there a way to get summary.csv, so I don't have to change the file extension before importing it into excel?
Thanks.
actually a lot of folks asked the similar question, right now it's not possible to have an extension for the files. (but parquet files can have)
The reason behind this is, RedShift by default export it in parallel which is a good thing. Each slice will export its data. Also from the docs,
PARALLEL
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
So it has to create new files after 6GB that's why they are adding numbers as a suffix.
How do we solve this?
No native options from RedShift, but we can do some workaround with lambda.
Create a new S3 bucket and a folder inside it specifically for this process.(eg: s3://unloadbucket/redshift-files/)
Your unload files should go to this folder.
Lambda function should be triggered based on S3 put object event.
Then the lambda function,
Download the file(if it is large use EFS)
Rename it with .csv
Upload to the same bucket(or different bucket) into a different path (eg: s3://unloadbucket/csvfiles/)
Or even more simple if you use shell/powershell script to do the following process
Download the file
Rename it with .csv
As per AWS Documentation around UNLOAD command, it's possible to save data as CSV.
In your case, this is what your code would look like:
unload ('select * from '')
to 's3://summary/'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key='''
CSV <<<
parallel off
allowoverwrite
CSV HEADER;

Output path folders to Data Lake Store without "ColumnName="

Is posible to use the function partitionBy or other without returning the path "ColumnName=Value"?
I´m using a python notebook in azure databricks to send a csv file to Azure Data Lake Store. The Cmd used is the following:
%scala
val filepath= "dbfs:/mnt/Test"
Sample
.coalesce(1)
.write
.mode("overwrite")
.partitionBy("Year","Month","Day")
.option("header", "true")
.option("delimiter",";")
.csv(filepath)
Expecting to have this path:
/Test/2018/12/11
Instead of:
/Test/Year=2018/Month=12/Day=11
This is expected behavior.
Spark uses directory path for partition with columns names.
If you need specific directory you should use downstream process to rename directory or you can filter your df and save one by one in the specific directory.

Split CSV file in records and save as a csv file format - Apache NIFI

What I want to do is the following...
I want to divide the input file into registers, convert each record into a
file and leave all the files in a directory.
My .csv file has the following structure:
ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
ERP,FRANK,DIETSCH,5064 E METAIRIE AVE.,BRANDSVILLA,MO,65687,252-5592,1176 E THAYER ST.,COLUMBIA,MO,65215,557,291-9571,217-38-5525,129-10-0407,1/13/35,M,
As you can see it doesn't have Header row.
Here is my flow.
My problem is that when the Split Proccessor divides my csv into flows with 400 lines, it isn't save in my output directory.
It's first time using NIFI, sorry.
Make sure your RecordReader controller service is configured correctly(delimiter..etc) to read the incoming flowfile.
Records per split value as 1
You need to use UpdateAttribute processor before PutFile processor to change the filename to unique value (like UUID) unless if you are configured PutFile processor Conflict Resolution strategy as Ignore
The reason behind changing filename is SplitRecord processor is going to have same filename for all the splitted flowfiles.
Flow:
I tried your case and flow worked as expected, Use this template for your reference and upload to your NiFi instance, Make changes as per your requirements.

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.
With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');
You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'