I'm trying to execute a map-reduce job using Spark 1.6 (spark-1.6.0-bin-hadoop2.4.tgz) that reads input from and writes output to S3.
The reads are working just fine with: sc.textFile(s3n://bucket/path/to/file/file.gz)
However, I'm having a bunch of trouble getting the writes to work. I'm using the same bucket to output the files: outputRDD.saveAsTextFile(s3n://bucket/path/to/output/)
When my input is extremely small (< 100 records), this seems to work fine. I'm seeing a part-NNNNN file written per partition with some of those files having 0 bytes and the rest being under 1 KB. Spot checking the non-empty files shows the correctly formatted map-reduce output. When I move to a slightly bigger input (~500 records), I'm seeing the same number of part-NNNNN files (my number of partitions are constant for these experiments), but each one is empty.
When I was experimenting with much bigger data sets (millions of records), my thought was that I was exceeding some S3 limits which was causing this problem. However, 500 records (which amounts to ~65 KB zipped) is still a trivially small amount of data that I would think Spark and S3 should handle easily.
I've tried using the S3 Block FileSystem instead of the S3 Native FileSystem as outlined here. But get the same results. I've turned on logging for my S3 bucket, but I can't seem to find a smoking gun there.
Has anyone else experienced this? Or can otherwise give me a clue as to what might be going wrong?
Turns out I was working on this too late at night. This morning, I took a step back and found a bug in my map-reduce which was effectively filtering out all the results.
You should use coalesce before saveAsTextFile
from spark programming guide
Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large
dataset.
eg:
outputRDD.coalesce(100).saveAsTextFile(s3n://bucket/path/to/output/)
Related
I have encountered very slow movement when copying data across one project to another project located in the same data location in bigquery, however it took up to 2 minutes for the data being moved which is just about 100,000 records, as compared to other operations we have done on bigquery copying data with hundreds of millions which took only a matter of a few seconds, hence I would like to find out why this unusual slow movement for such a small data set occurred. Did anyone come across similar issue and have any idea what could be the cause behind it please?
Thanks.
Best regards,
The cause of the slow copy problem could come from method of creation your source table, e.g. it could have been created by several imports jobs that can caused such a fragmentation.
So the difference in time is not because the amount of data stored in your table, but the way the data is fragmented inside.
Although the running time is very reasonable, if you want to speed it up more, you can try COALESCE/MERGE your table. One way of doing this is to export the table to Google Cloud Storage and re-import it back (not append). This should reduce the fragmentation and help in case you want to optimize your operations and gain a few seconds.
Running time of few minutes for table copy method is considered internally as absolutely normal for a table copy job and this does not classify as a BigQuery deficiency.
Refer to official documentation. And if you want to know more about fragmentation in BigQuery, I recommend you O'REILLY "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" book.
I hope you find the above pieces of information useful.
I have been using Select Hive processor to fetch data from Hive and create CSV files. I am observing for around 7 Million records, it takes around 5 minutes. When observed closely, It was found that data fetch from Hive is faster and hardly takes less 10% of the overall time but it is taking too long to write files in CSVs. I am using 8 Cores and 32GB RAM. I have configured heap memory of 16 GB. Can someone please help to improve this performance? Do I need to do any system level settings?
The CSV output option of SelectHiveQL could certainly be improved, currently it builds each row as a string in memory and then writes it to the flow file, but it probably could just write straight to the flow file, etc. Please feel free to file a Jira for this improvement.
I have millions of S3 files, whose sizes average about 250k but are highly variable (up to a few 4 GB size). I can't easily use wildcards to pick out multiple files, but I can make an RDD holding the S3 URLs of the files I want to process at any time.
I'd like to get two kinds of paired RDDs. The first would have the S3 URL, then the contents of the file as a Unicode string. (Is that even possible when some of the files can be so long?) The second could be computed from the first, by split()-ting the long string at newlines.
I've tried a number of ways to do this, typically getting a Python PicklingError, unless I iterate though the PII of S3 URLs one at a time. Then I can use union() to build up the big pairRDDs I want, as was described in another question. But I don't think that is going to run in parallel, which will be important when dealing with lots of files.
I'm currently using Python, but can switch to Scala or Java if needed.
Thanks in advance.
The size of the files shouldn't matter as long as your cluster has the in-memory capacity. Generally, you'll need to do some tuning before everything works.
I'm not versed with python so I can't comment too much on pickling error. Perhaps these links might help but I'll add python tag so that someone better can take a look.
cloudpickle.py
pyspark serializer can't handle functions
Does anybody have insights on the internal working of NativeS3FileSystem with different InputFormat's in Amazon EMR case as compared to normal Hadoop HDFS i.e. input split calculation, actual data flow? What is the best practices & points to consider when using Amazon EMR with S3?
Thanks,
What's important is that if you're planning to use S3N instead of HDFS, you should know that it means you will lose the benefits of data locality, which can have a significant impact on your jobs.
In general when using S3N you have 2 choices for your jobflows:
Stream data from S3 as a replacement for HDFS: this is useful if you need constant access to your whole dataset, but as explained there can be some performance constraints.
Copy your data from S3 to HDFS: if you only need access to a small sample of your data at some point in time, you should just copy to HDFS to retain the benefit of data locality.
From my experience I also noticed that for large jobs, splits calculation can become quite heavy, and I've even seen cases where the CPU was at 100% just for calculating input splits. The reason for that is that I think the Hadoop FileSystem layer tries to get the size of each file separately, which in case of files stored in S3N involves sending API calls for every file, so if you have a big job with many input files that's where the time can be spent.
For more information, I would advise taking a look at the following article where someone asked a similar questions on the Amazon forums.
The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.