Amazon EMR NativeS3FileSystem internals query - amazon-s3

Does anybody have insights on the internal working of NativeS3FileSystem with different InputFormat's in Amazon EMR case as compared to normal Hadoop HDFS i.e. input split calculation, actual data flow? What is the best practices & points to consider when using Amazon EMR with S3?
Thanks,

What's important is that if you're planning to use S3N instead of HDFS, you should know that it means you will lose the benefits of data locality, which can have a significant impact on your jobs.
In general when using S3N you have 2 choices for your jobflows:
Stream data from S3 as a replacement for HDFS: this is useful if you need constant access to your whole dataset, but as explained there can be some performance constraints.
Copy your data from S3 to HDFS: if you only need access to a small sample of your data at some point in time, you should just copy to HDFS to retain the benefit of data locality.
From my experience I also noticed that for large jobs, splits calculation can become quite heavy, and I've even seen cases where the CPU was at 100% just for calculating input splits. The reason for that is that I think the Hadoop FileSystem layer tries to get the size of each file separately, which in case of files stored in S3N involves sending API calls for every file, so if you have a big job with many input files that's where the time can be spent.
For more information, I would advise taking a look at the following article where someone asked a similar questions on the Amazon forums.

Related

is redshift really required when quicksight can query directly from s3 using athena?

We have data dumped into S3 buckets and we are using these data to pull some reports in Quicksight some directly accessing s3 as datasource and for other reports, we used Athena to query S3.
At which point, one need to use Redshift? Is there any advantage of using Redshift over S3+Athena?
No you might be perfectly fine with just QuickSight, Athena and S3 - it will be relatively cheaper as well if you keep Redshift out of the equation. Athena is based on PRESTO and is pretty comprehensive in terms of functionality for most SQL reporting needs.
You would need Redshift if you approach or hit the QuickSight's SPICE limits and would still like your reports to be snappy and load quickly. From a data engineering side, if you need to update existing records it is easier to micro batch and update records in RedShift. With athena/s3 you also need to take care of optimising the storage format (use orc/parquet file formats, use partitions, not use small files etc...) - it is not rocket science but some people prefer paying for RedShift and not having to worry about that at all.
In the end, RedShift will probably scale better when your data grows very large (into the petabyte scale). However, my suggestion would be to keep using Athena and follow it's best practices and only use RedShift if you anticipate huge growth and want to be sure that you can scale the underlying engine on demand (and, of course, are willing to pay extra for it).

cloud vs HDFS for Big Data staging area

What are the advantages and Disadvantages of having HDFS as a staging area for Big Data of 20 TB.
Which is best staging persistant layer. can we use HDFS or shall we opt for s3 cloud. Kindly share your expertise.
my findings:
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed
for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets
to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures.
The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests).
But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of
deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue,
mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold.
There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas.
The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage).
This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch
scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA
Doubts: will s3 is better or HDFS?
Firstly, I think you are confusing Cassandra as an HDFC system which is wrong. Also, I don't think you should be comparing Cassandra and an HDFS system. They have exactly opposite use case.
Cassandra is used when you have a high throughput of writes and reads are limited. It is very difficult to run map-reduce operations in Cassandra as you will be limited by partition and clustering keys.
HDFS is mainly used for map-reduce jobs where you are uploading files in a pre-defined format and you want to run analytical queries on any column which may or may not be partitioning key.
Doubts: will s3 is better or HDFS?
S3 is an HDFS system hosted on cloud. So I am assuming whether cloud HDFS is better or local HDFS. It depends on your use case. But using S3 gives you many advantages of almost infinite scalability. If your data is present in S3 you can use AWS EMR to run your map-reduce jobs. They give a high level of monitoring. These things are difficult to do if you are running a local HDFS.
Here is a good tutorial that you should read.

Storing large objects in Couchbase - best practice?

In my system, a user can upload very large files, which I need to store in Couchbase. I don't need such very large objects to persist in memory, but I want them to be always read/written from/to disk. These files are read-only (never modified). The user can upload them, delete them, download them, but never update them. For some technical constraints, my system cannot store those files in the file system, so they have to be stored into the database.
I've done some research and found an article[1] saying that storing large objects in a database is generally a bad idea, especially with Couchbase, but at the same time provides some advice: create a secondary bucket with a low RAM quota, tune up the value/full eviction policy. My concern is the limit of 20Mb mentioned by the author. My files would be much larger than that.
What's the best approach to follow to store large files into Couchbase without having them persist in memory? Is it possible to raise the limit of 20Mb in case? Shall I create a secondary bucket with a very low RAM quota and a full eviction policy?
[1]http://blog.couchbase.com/2016/january/large-objects-in-a-database
Generally, Couchbase engineers recommend that you not store large files in Couchbase. Instead, you can store the files on some file server (like AWS or Azure Blob or something) and instead store the meta-data about the files in Couchbase.
There's a couchbase blog posting that gives a pretty detailed breakdown of how to do what you want to do in Couchbase.
This is Java API specific but the general approach can work with any of the Couchbase SDKs, I'm actually in the midst of doing something pretty similar right now with the node SDK.
I can't speak for what couchbase engineers recommend but they've posted this blog entry detailing how to do it.
For large files, you'll certainly want to split into chunks. Do not attempt to store a big file all in one document. The approach I'm looking at is to chunk the data, and insert it under the file sha1 hash. So file "Foo.docx" would get split into say 4 chunks, which would be "sha1|0", "sha1|1" and so on, where sha1 is the hash of the document. This would also enable a setup where you can store the same file under many different names.
Tradeoffs -- if integration with Amazon S3 is an option for you, you might be better off with that. In general chunking data in a DB like what I describe is going to be more complicated to implement, and much slower, than using something like Amazon S3. But that has to be traded off other requirements, like whether or not you can keep sensitive files in S3, or whether you want to deal with maintaining a filesystem and the associated scaling of that.
So it depends on what your requirements are. If you want speed/performance, don't put your files in Couchbase -- but can you do it? Sure. I've done it myself, and the blog post above describes a separate way to do it.
There are all kinds of interesting extensions you might wish to implement, depending on your needs. For example, if you commonly store many different files with similar content, you might implement a blocking strategy that would allow single-store of many common segments, to save space. Other solutions like S3 will happily store copies of copies of copies of copies, and gleefully charge you huge amounts of money to do so.
EDIT as a follow-up, there's this other Couchbase post talking about why storing in the DB might not be a good idea. Reasonable things to consider - but again it depends on your application-specific requirements. "Use S3" I think would be generally good advice, but won't work for everyone.
MongoDB has an option to do this sort of thing, and it's supported in almost all drivers: GridFS. You could do something like GridFS in Couchbase, which is to make a metadata collection (bucket) and a chunk collection with fixed size blobs. GridFS allows you to change the blob size per file, but all blobs must be the same size. The filesize is stored in the metadata. A typical chunk size is 2048, and are restricted to powers of 2.
You don't need memory cache for files, you can queue up the chunks for download in your app server. You may want to try GridFS on Mongo first, and then see if you can adapt it to Couchbase, but there is always this: https://github.com/couchbaselabs/cbfs
This is the best practice: do not take couchbase database as the main database consider it as sync database because no matter how you chunk data into small pieces it will go above 20MB size which will hit you in long run, so having a strong database like MySQL in a middle will help to save those large data then use couchbase for realtime and sync only.

Pig does not import data into system before applying queries

I was reading some documentations on Pig Latin and could not fully understand why would Pig not need to import the data into the system before applying queries, during data analysis?
Can someone please explain? Thanks.
In Hadoop and HDFS there is a concept of Data Locality, which actually means that "Bringing your computer/code near to data" not bringing the data near to computer.
This concepts applied to all the data processing technology over Hadoop, like MapReduce, Hive and Pig.This is the mail reason Pig doesn't import the data into the system instead it goes near to data and analyze it.
Data locality: An important concept with HDFS and MapReduce, data locality can best be described as “bringing the compute to the data.” In other words, whenever you use a MapReduce program on a particular part of HDFS data, you always want to run that program on the node, or machine, that actually stores this data in HDFS. Doing so allows processes to be run much faster, since it prevents you from having to move large amounts of data around.
When a MapReduce job is submitted, part of what the JobTracker does is look to see which machines the blocks required for the task are located on. This is why, when the NameNode splits data files into blocks, each one is replicated three times: the first is stored on the same machine as the block, while the second and third are each stored on separate machines.
Storing the data across three machines thus gives you a much higher chance of achieving data locality, since it’s likely that at least one of the machines will be freed up enough to process the data stored at that particular location.
Reference: http://www.plottingsuccess.com/hadoop-101-important-terms-explained-0314/

Why doesn't Hadoop file system support random I/O?

The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.