Merging multiple files in Pig - apache-pig

I have several files (around 10 files) which I would like to merge together in Pig:
Student01.txt
Student02.txt
...
Student10.txt
I am aware that I could merge two datasets together by:
data = UNION Student01, Student02
Is there any way that I could iterate over a loop to merge the dataset from Student01 to Student10?

Assuming the files are in the same format, then LOAD command allows you to read all files if you provide it a directory or a glob.
From docs -
The input data to the load can be a file, a directory or a glob
Example
STUDENTS = LOAD("/path/to/students/Student*.txt") USING PigStorage();

Related

Importing a *random* csv file from a folder into pandas

I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?
You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question

How to prevent Apache pig from outputting empty files?

I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like:
DIR--
--Subdir1
--Subdir2
--Subdir3
--Subdir4
In the pig script I am simply doing a load, filter and store. It looks like:
items = LOAD path USING AvroStorage()
items = FILTER items BY some property
STORE items into outputDirectory using AvroStorage()
The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a way to remove those files? Thanks!
For pig version 0.13 and later, you can set pig.output.lazy=true to avoid creating empty files. (https://issues.apache.org/jira/browse/PIG-3299)

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.
With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');
You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

Avoiding multiple headers in pig output files

We use Pig to load files from directories containing thousands of files, transform them, and then output files that are a consolidation of the input.
We've noticed that the output files contain the header record of every file processed, i.e. the header appears multiple times in each file.
Is there any way to have the header only once per output file?
raw_data = LOAD '$INPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',')
DO SOME TRANSFORMS
STORE data INTO '$OUTPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage('|')
Did you try this option?
SKIP_INPUT_HEADER
See https://github.com/apache/pig/blob/31278ce56a18f821e9c98c800bef5e11e5396a69/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java#L85

Using R functions lapply and read.sql.csv

I am trying to open multiple csv files using a list such as the below;
filenames <- list.files("temp", pattern="*.csv", full.names=TRUE)
I have found examples that use lapply and read.csv to open all the files in the temp directory, but I know appriori what data i need to extract from the file, so to save time reading i want to use the SQL extension of this;
somefile = read.csv.sql("temp/somefile.csv", sql="select * from file ",eol="\n")
However i am having trouble combining these two pieces of functionality into a single command such that i can read all the files in a directory applying the same sql query.
Has anybody had success doing this?
If you want a list of dataframes from each file (assuming your working directory contains the .csv files):
files <- list.files(".", pattern="*.csv")
df.list <- sapply(filenames, read.csv.sql,sql="select * from file ",eol="\n",simplify=F)
Or if you want them all combined:
df <- ldply(filenames, read.csv.sql,sql="select * from file ",eol="\n")