How to prevent Apache pig from outputting empty files? - apache-pig

I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like:
DIR--
--Subdir1
--Subdir2
--Subdir3
--Subdir4
In the pig script I am simply doing a load, filter and store. It looks like:
items = LOAD path USING AvroStorage()
items = FILTER items BY some property
STORE items into outputDirectory using AvroStorage()
The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a way to remove those files? Thanks!

For pig version 0.13 and later, you can set pig.output.lazy=true to avoid creating empty files. (https://issues.apache.org/jira/browse/PIG-3299)

Related

Merging multiple files in Pig

I have several files (around 10 files) which I would like to merge together in Pig:
Student01.txt
Student02.txt
...
Student10.txt
I am aware that I could merge two datasets together by:
data = UNION Student01, Student02
Is there any way that I could iterate over a loop to merge the dataset from Student01 to Student10?
Assuming the files are in the same format, then LOAD command allows you to read all files if you provide it a directory or a glob.
From docs -
The input data to the load can be a file, a directory or a glob
Example
STUDENTS = LOAD("/path/to/students/Student*.txt") USING PigStorage();

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Create a ADF Dataset to load multiple csv files (same format) from the Blob

I try to create a dataset containing multiple csv files from the Blob. In the file path of dataset setting: I create a parameter - #dataset().FolderName and add FolderName in the Parameters.
I leave file (from File Path) empty as I want to grab all files in the folder. However, there is no data when I preview data. Is there anything missing? Thank you
I have tested it on my side and it can work fine.
add FolderName parameter
preview data
If you want to merge all csv files in Data Flow, you can do this:
1.output to single file
2.set Single partition

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.
With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');
You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

Avoiding multiple headers in pig output files

We use Pig to load files from directories containing thousands of files, transform them, and then output files that are a consolidation of the input.
We've noticed that the output files contain the header record of every file processed, i.e. the header appears multiple times in each file.
Is there any way to have the header only once per output file?
raw_data = LOAD '$INPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',')
DO SOME TRANSFORMS
STORE data INTO '$OUTPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage('|')
Did you try this option?
SKIP_INPUT_HEADER
See https://github.com/apache/pig/blob/31278ce56a18f821e9c98c800bef5e11e5396a69/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java#L85