How map-reduce works on HDFS vs S3? - amazon-s3

I have been trying to understand how different a map-reduce job is executed on HDFS vs S3. Can someone please address my questions:
Typically HDFS clusters are not only storage oriented, but also contain horsepower to execute MR jobs; and that is why the jobs are mapped on several data nodes and reduced on few. To be exact, the mapping (filter etc) is done on data locally, whereas the reducing (aggregation) is done on common node.
Does this approach work as it is on S3? As far as I understand, S3 is just a data store. Does hadoop has to COPY WHOLE data from S3 and then run Map (filter) and reduce (aggregation) locally? or it follows exactly same approach as HDFS. If the former case is true, running jobs on S3 could be slower than running jobs on HDFS (due to copying overhead).
Please share your thoughts.

Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity and other data recovery schemes(Netflix uses a Hadoop cluster using S3).
Theoretically, before the split computation, the sizes of input files need to be determined, so hadoop itself has an filesystem implementation on top of S3 which allows higher layers to be agnostic of the source of the data. Map-Reduce calls the generic file listing API against each input directory to get the size of all files in the directory.
Amazons EMR have a special version of the S3 File System that can stream data directly to S3 instead of buffering to intermediate local files this can make it faster on EMR.

If you have a Hadoop cluster in EC2 and you run a MapReduce job over S3 data, yes the data will be streamed into the cluster in order to run the job. As you say, S3 is just a data store, so you can not bring the computation to the data. These non-local reads could cause a bottleneck on processing large jobs, depending on the size of the data and the size of the cluster.

Related

How to copy Big Data from GCS to S3?

How to copy a few terabytes of data from GCS to S3?
There's nice "Transfer" feature in GCS that allows to import data from S3 to GCS. But how to do the export, the other way (besides moving data generation jobs to AWS)?
Q: Why not gsutil?
Yes, gsutil supports s3://, but transfer is limited by that machine network throughput. How to easier do it in parallel?
I tried Dataflow (aka Apache Beam now), that would work fine, because it's easy to parallelize on like a hundred of nodes, but don't see there's simple 'just copy it from here to there' function.
UPDATE: Also, Beam seems to be computing a list of source files on the local machine in a single thread, before starting the pipeline. In my case that takes around 40 minutes. Would be nice to distribute it on the cloud.
UPDATE 2: So far I'm inclined to use two own scripts that would:
Script A: Lists all objects to transfer, and enqueue a transfer task for each one into a PubSub queue.
Script B: Executes these transfer tasks. Runs on cloud (e.g. Kubernetes), many instances in parallel
The drawback is that it's writing a code that may contain bugs etc, not using a built-in solution like GCS "Transfer".
You could use gsutil running on Compute Engine (or EC2) instances (which may have higher network bandwidth available than your local machine).
Using gsutil -m cp will parallelize copying across objects, but individual objects will still be copied sequentially.

Running Spark application using HDFS or S3

In my spark application, I just want to access a big file, and distribute the computation across many nodes on EC2.
Initially, my file is stored on S3.
It's very convenient for me to load the file with sc.textFile() function from S3.
However, I can put some efforts to load the data to HDFS and then read the data from there.
My question is, will the performance be better with HDFS?
My code involves the spark partitions(mapPartitions transforamtion), so does it really matter what is my initial file system?
Obviously when using S3 the latency is higher and the data throughput is lower compared to HDFS on local disk.
But it depends what you do with your data. It seems most of programs are limited more by CPU power than network throughput. So you should be fine with the 1Gbps throughput that you get from S3.
Anyway you can check recent slides from Aaron Davidson's talk on Spark Summit 2015. This topic is discussed there.
http://www.slideshare.net/databricks/spark-summit-eu-2015-lessons-from-300-production-users/16

Saving a >>25T SchemaRDD in Parquet format on S3

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like this
import org.apache.spark._
val sqlContext = sql.SQLContext(sc)
val data = sqlContext.jsonFile("s3n://...", 10e-6)
data.saveAsParquetFile("s3n://...")
I run into problems if I have more than about 2000 partitions or if there is partition larger than 5G.
This puts an upper bound on the maximum size SchemaRDD I can process this way.
The prctical limit is closer to 1T since partitions sizes vary widely and you only need 1 5G partition to have the process fail.
Questions dealing with the specific problems I have encountered are
Multipart uploads to Amazon S3 from Apache Spark
Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL
Spark SQL unable to complete writing Parquet data with a large number of shards
This questions is to see if there are any solutions to the main goal that do not necessarily involve solving one the above problems directly.
To distill things down there are 2 problems
Writing a single shard larger than 5G to S3 fails. AFAIK this a built in limit of s3n:// buckets. It should be possible for s3:// buckets but does not seem to work from Spark and hadoop distcp from local HDFS can not do it either.
Writing the summary file tends to fail once there are 1000s of shards. There seem to be multiple issues with this. Writing directly to S3 produces the error in the linked question above. Writing directly to local HDFS produces an OOM error even on an r3.8xlarge (244G ram) once when there about 5000 shards. This seems to be independent of the actual data volume. The summary file seems essential for efficient querying.
Taken together these problems limit Parquet tables on S3 to 25T. In practice it is actually significantly less since shard sizes can vary widely within an RDD and the 5G limit applies to the largest shard.
How can I write a >>25T RDD as Parquet to S3?
I am using Spark-1.1.0.
From AWS S3 documentation:
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from 1 byte to 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
One way to go around this:
Attache an EBS volume to your system, format it.
Copy the files to the "local" EBS volume.
Snapshot the volume, it goes to your S3 automatically.
It also gives a smaller load on your instance.
To access that data, you need to attache the snapshot as an EBS to an instance.

Using Glacier as back end for web crawling

I will be doing a crawl of several million URLs from EC2 over a few months and I am thinking about where I ought to store this data. My eventual goal is to analyze it, but the analysis might not be immediate (even though I would like to crawl it now for other reasons) and I may want to eventually transfer a copy of the data out for storage on a local device I have. I estimate the data will be around 5TB.
My question: I am considering using Glacier for this, with the idea that I will run a multithreaded crawler that stores the crawled pages locally (on EB) and then use a separate thread that combines, compresses, and shuttles that data to Glacier. I know transfer speeds on Glacier are not necessarily good, but since there is no online element of this process, it would seem feasible (esp since I could always increase the size of my local EBS volume in case I'm crawling faster than I can store to Glacier).
Is there a flaw in my approach or can anyone suggest a more cost-effective, reliable way to do this?
Thanks!
Redshift seems more relevant than Glacier. Glacier is all about freeze / thaw and you'll have to move the data prior to doing any analysis.
Redshift is more about adding the data into a large, inexpensive, data warehouse and running queries over it.
Another option is to store the data in EBS and leave it there. When you're done with your crawling take a Snapshot to push the volume into S3 and decomission the volume and EC2 instance. Then when you're ready to do the analysis just create a volume from the snapshot.
The upside of this approach is that it's all file access (no formal data store) which may be easier for you.
Personally, I would probably push the data into Redshift. :-)
--
Chris
If your analysis will not be immediate then you can adopt one of the following 2 approaches
Approach 1) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> archive regularly to glacier. You can store your last X days data in Amazon S3 and use it for adhoc processing as well.
Approach 2) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon Glacier. Retrieve when needed and do the processing on EMR or other processing tools
If you need frequent analysis:
Approach 3) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> Analysis through EMR or other tools and store the processed results in S3/DB/MPP and move the raw files to glacier
Approach 4) if your data is structured, then Amazon EC2 crawler -> store in EBS disks and move them to Amazon RedShift and move the raw files to glacier
Additional tips:
If you can retrieve the data again(from source)then you can use ephemeral disks for your crawlers instead of EBS
Amazon has introduced Data pipeline service, check whether it fits your needs on data movement.

Writing single Hadoop map reduce output into multiple S3 objects

I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects.
Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files.
How do I achieve this?
I did this by just writing the output directly from my reducer method to S3, using an S3 toolkit. Since I was running on EC2, this was quick and free.
In general, you want Hadoop to handle your input and output as much as possible, for cleaner mappers and reducers; and, of course, you want to write to S3 at the very end of your pipeline, to let Hadoop's code moving do it's job over HDFS.
In any case, I recommend doing all of your data partitioning, and writing entire output sets to S3 in a final reduce task, one set per S3 file. This puts as little writer logic in your code as possible. This paid off for me because I ended up with a minimal Hadoop S3 toolkit which I used for several task flows.
I needed to write to S3 in my reducer code because the S3/S3n filesystems weren't mature; they might work better now.
Do you also know the MultipleOutputFormat?
It's not related to S3, but in general it allows to write output to multiple files, implementing a given logic.