ORC format - PIG - dependent libraries - apache-pig

I understand to write into ORC format + snappy compression (pig script),
using OrcStorage('-c SNAPPY')
I need your help, what is the SET command or necessary library I need to include to enable storing result dataset into ORC format?
Please help.
Subra

Check what pig version are you using.
ORC storage is available from pig14 as a build in function.
Check the examples:
https://pig.apache.org/docs/r0.14.0/func.html#OrcStorage
UPDATE
This pig just works fine:
data = LOAD 'SO/date.txt' USING PigStorage(' ') AS (ts:chararray);
STORE data INTO 'orc/snappy' using OrcStorage('-c SNAPPY');
data_orc = LOAD 'orc/snappy' using OrcStorage('-c SNAPPY');
DUMP data_orc;
You don't even need to register the kryo jar, because that not used directly from the pig so it will be optimized out, but you use it via reflection so you have to add the kryo jar to the classpath, like this:
pig -latest -useHCatalog -cp ./kryo-2.24.0.jar orc.pig

Related

How to use spark toLocalIterator to write a single file in local file system from cluster

I have a pyspark job which writes my resultant dataframe in local filesystem. Currently it is running in local mode and so I am doing coalesce(1) to get a single file as below
file_format = 'avro' # will be dynamic and so it will be like avro, json, csv, etc
df.coalesce.write.format(file_format).save('file:///pyspark_data/output')
But I see a lot of memory issues (OOM) and takes longer time as well. So I want to run this job with master as yarn and mode as client. And so to write the result df into a single file in localsystem, I need to use toLocalIterator which yields Rows. How can I stream these Rows into a file of required format (json/avro/csv/parquet and so on)?
file_format = 'avro'
for row in df.toLocalIterator():
# write the data into a single file
pass
You get OOM error because you try to retrieve all the data into a single partition with: coalesce(1)
I dont recommend to use toLocalIterator because you will re-rewrite a custom writer for every format and you wont have parallele writing.
You first solution is a good one :
df.write.format(file_format).save('file:///pyspark_data/output')
if you use hadoop you can retrieve all the data into one on filesysteme this way : (it work for csv, you can try for other) :
hadoop fs -getmerge <HDFS src> <FS destination>

How to prevent Apache pig from outputting empty files?

I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like:
DIR--
--Subdir1
--Subdir2
--Subdir3
--Subdir4
In the pig script I am simply doing a load, filter and store. It looks like:
items = LOAD path USING AvroStorage()
items = FILTER items BY some property
STORE items into outputDirectory using AvroStorage()
The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a way to remove those files? Thanks!
For pig version 0.13 and later, you can set pig.output.lazy=true to avoid creating empty files. (https://issues.apache.org/jira/browse/PIG-3299)

Use a Hive Registered UDF in Spark

I register a udf in Hive through beeline using the following:
CREATE FUNCTION udfTest AS 'my.udf.SimpleUDF' USING JAR 'hdfs://hostname/pathToMyJar.jar'
Then I can use it in beeline as follows:
SELECT udfTest(name) from myTable;
Which returns the expected result.
I then launch a spark-shell and run the following
sqlContext.sql("SELECT udfTest(name) from myTable")
Which fails. The stack is several hundred lines long (which I can't paste here) but the key parts are:
org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader
Unable to load resources for default.udftest:java.lang.IllegalArgumentException: Unable to register [/tmp/blarg/pathToMyJar.jar]
I can provide more detail if anything stands out.
Is it possible to use UDFs Registered through Hive in Spark?
Spark Version 1.3.0
When using a custom UDF, make sure that the jar file for your UDF is included with your application, OR use the --jars command-line option to specify the UDF-file as a parameter while launching spark-shell as shown below
./bin/spark-shell --jars <path-to-your-hive-udf>.jar
For more details refer Calling Hive User-Defined Functions from Spark.
we had the same issue recently. What we noticed was that if the jar path is available locally then all goes through fine. and if the jar path is on hdfs , it doesnt work. So what we ended up doing was copying the jar locally using FileSystem.copytoLocalFile and then adding the copied file. Worked for us in cluster and client mode
PS. this is Spark 2.0 Im talking about

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.
With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');
You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

BZip2 compressed input for Apache Flink

I have a wikipedia dump compressed with bzip2 (downloaded from http://dumps.wikimedia.org/enwiki/), but I don't want to unpack it: I want to process it while decompressing on the fly.
I know that it's possible to do it in plain Java (see e.g. Java - Read BZ2 file and uncompress/parse on the fly), but I was wondering how do it in Apache Flink? What I probably need is something like https://github.com/whym/wikihadoop but for Flink, not Hadoop.
It is possible to read compressed files in the following formats in Apache Flink:
org.apache.hadoop.io.compress.BZip2Codec
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.DeflateCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.Lz4Codec
org.apache.hadoop.io.compress.SnappyCodec
As you can see from the package names, Flink does this using Hadoop's InputFormats.
This is an example for reading gz files using Flink's Scala API:
(You need at least Flink 0.8.1)
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val job = new JobConf()
val hadoopInput = new TextInputFormat()
FileInputFormat.addInputPath(job, new Path("/home/robert/Downloads/cawiki-20140407-all-titles.gz"))
val lines = env.createHadoopInput(hadoopInput, classOf[LongWritable], classOf[Text], job)
lines.print
env.execute("Read gz files")
}
Apache Flink has only build-in support for .deflate files. Adding support for more compression codecs is easy to do, but hasn't been done yet.
Using HadoopInputFormats with Flink doesn't cause any performance loss. Flink has build-in serialization support for Hadoop's Writable types.