Writing Spark RDD as Gzipped file in Amazon s3 - amazon-s3

I have an output RDD in my spark code written in python. I want to save it in Amazon S3 as gzipped file. I have tried following functions.
The below function correctly saves the output rdd in s3 but not in gzipped format.
output_rdd.saveAsTextFile("s3://<name-of-bucket>/")
The below function returns error:: TypeError: saveAsHadoopFile() takes at least 3 arguments (3 given)
output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
)
Please guide me with the correct way to do this.

You need to specify the output format as well.
Try this:
output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", "org.apache.hadoop.mapred.TextOutputFormat", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
You can use any of the Hadoop-supported compression codecs:
gzip:
org.apache.hadoop.io.compress.GzipCodec
bzip2:
org.apache.hadoop.io.compress.BZip2Codec
LZO:
com.hadoop.compression.lzo.LzopCodec

Related

Control the compression level when writing Parquet files using Polars in Rust

I found that by default polars' output Parquet files are around 35% larger than Parquet files output by Spark (on the same data). Spark uses snappy for compression by default and it doesn't help if I switch ParquetCompression to snappy in polars. I wonder is this due to polars use a more conservative compression ratio? Is there any way to control the compression level of Parquet files in polars? I checked the doc of polars, it seems that only Zstd accept a ZstdLevel (not even sure whether it is compression level).
Below is my code to write a DataFrame to a Parquet file using the snappy compression.
let f = File::create("j.parquet").expect("Unable to create the file j.parquet!");
let mut bfw = BufWriter::new(f);
let pw = ParquetWriter::new(bfw).with_compression(ParquetCompression::Snappy);
pw.finish(&mut df);
This is not (yet) possible in rust polars. It will likely be in next release of arrow2 and then we can implement it in polars as well.
If you want that functionality in python polars you can leverage pyarrow for this purpose. polars has zero copy interop with pyarrow.

Using blockchain_parser to parse blk files stored on S3

I'm trying to use alecalve's bitcoin-parser pkg for python.
The problem is, my node is saved on an s3 bucket.
As the parser uses os.path.expanduser for the .dat files dir, expecting a filesystem, I can't just use my s3 path. The example from the documentation is:
import os
from blockchain_parser.blockchain import Blockchain
blockchain = Blockchain(os.path.expanduser('~/.bitcoin/blocks'))
for block in blockchain.get_ordered_blocks(os.path.expanduser('~/.bitcoin/blocks/index'), end=1000):
print("height=%d block=%s" % (block.height, block.hash))
And the error I'm getting is as follows:
'name' arg must be a byte string or a unicode string
Is there a way to use s3fs or any different s3-to-filesystem method to use the s3 paths as dirs for the parser to work as intended?

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

How to use spark toLocalIterator to write a single file in local file system from cluster

I have a pyspark job which writes my resultant dataframe in local filesystem. Currently it is running in local mode and so I am doing coalesce(1) to get a single file as below
file_format = 'avro' # will be dynamic and so it will be like avro, json, csv, etc
df.coalesce.write.format(file_format).save('file:///pyspark_data/output')
But I see a lot of memory issues (OOM) and takes longer time as well. So I want to run this job with master as yarn and mode as client. And so to write the result df into a single file in localsystem, I need to use toLocalIterator which yields Rows. How can I stream these Rows into a file of required format (json/avro/csv/parquet and so on)?
file_format = 'avro'
for row in df.toLocalIterator():
# write the data into a single file
pass
You get OOM error because you try to retrieve all the data into a single partition with: coalesce(1)
I dont recommend to use toLocalIterator because you will re-rewrite a custom writer for every format and you wont have parallele writing.
You first solution is a good one :
df.write.format(file_format).save('file:///pyspark_data/output')
if you use hadoop you can retrieve all the data into one on filesysteme this way : (it work for csv, you can try for other) :
hadoop fs -getmerge <HDFS src> <FS destination>

ORC format - PIG - dependent libraries

I understand to write into ORC format + snappy compression (pig script),
using OrcStorage('-c SNAPPY')
I need your help, what is the SET command or necessary library I need to include to enable storing result dataset into ORC format?
Please help.
Subra
Check what pig version are you using.
ORC storage is available from pig14 as a build in function.
Check the examples:
https://pig.apache.org/docs/r0.14.0/func.html#OrcStorage
UPDATE
This pig just works fine:
data = LOAD 'SO/date.txt' USING PigStorage(' ') AS (ts:chararray);
STORE data INTO 'orc/snappy' using OrcStorage('-c SNAPPY');
data_orc = LOAD 'orc/snappy' using OrcStorage('-c SNAPPY');
DUMP data_orc;
You don't even need to register the kryo jar, because that not used directly from the pig so it will be optimized out, but you use it via reflection so you have to add the kryo jar to the classpath, like this:
pig -latest -useHCatalog -cp ./kryo-2.24.0.jar orc.pig