I have a use case where i have to transfer one million or more files in HDFS. File size can be vary from 10kb to 50kb.
I am using spool dir source and HDFS sink and file channel.
I am also using BLOB deserilizer as i do not want to break my source data.it should get transfer complete file as an event that i am able to achieve.
So far my flume agent design looks like this - my flume agent design
Still i am not able to get good performance.
I also want to understand is hadoop cluster configuration can be helpful to improve the performance?
AFAIK, there is no silver bullet for performance tuning. As usual, you will need to experiment and learn based on your data and infrastructure. The following articles discuss the various knobs (and general guidance) available to fine tune Flume performance:
Cloudera - Flume Performance Tuning, DZone - Flume Performance Tuning
Related
What are the advantages and Disadvantages of having HDFS as a staging area for Big Data of 20 TB.
Which is best staging persistant layer. can we use HDFS or shall we opt for s3 cloud. Kindly share your expertise.
my findings:
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed
for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets
to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures.
The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests).
But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of
deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue,
mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold.
There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas.
The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage).
This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch
scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA
Doubts: will s3 is better or HDFS?
Firstly, I think you are confusing Cassandra as an HDFC system which is wrong. Also, I don't think you should be comparing Cassandra and an HDFS system. They have exactly opposite use case.
Cassandra is used when you have a high throughput of writes and reads are limited. It is very difficult to run map-reduce operations in Cassandra as you will be limited by partition and clustering keys.
HDFS is mainly used for map-reduce jobs where you are uploading files in a pre-defined format and you want to run analytical queries on any column which may or may not be partitioning key.
Doubts: will s3 is better or HDFS?
S3 is an HDFS system hosted on cloud. So I am assuming whether cloud HDFS is better or local HDFS. It depends on your use case. But using S3 gives you many advantages of almost infinite scalability. If your data is present in S3 you can use AWS EMR to run your map-reduce jobs. They give a high level of monitoring. These things are difficult to do if you are running a local HDFS.
Here is a good tutorial that you should read.
In my system, a user can upload very large files, which I need to store in Couchbase. I don't need such very large objects to persist in memory, but I want them to be always read/written from/to disk. These files are read-only (never modified). The user can upload them, delete them, download them, but never update them. For some technical constraints, my system cannot store those files in the file system, so they have to be stored into the database.
I've done some research and found an article[1] saying that storing large objects in a database is generally a bad idea, especially with Couchbase, but at the same time provides some advice: create a secondary bucket with a low RAM quota, tune up the value/full eviction policy. My concern is the limit of 20Mb mentioned by the author. My files would be much larger than that.
What's the best approach to follow to store large files into Couchbase without having them persist in memory? Is it possible to raise the limit of 20Mb in case? Shall I create a secondary bucket with a very low RAM quota and a full eviction policy?
[1]http://blog.couchbase.com/2016/january/large-objects-in-a-database
Generally, Couchbase engineers recommend that you not store large files in Couchbase. Instead, you can store the files on some file server (like AWS or Azure Blob or something) and instead store the meta-data about the files in Couchbase.
There's a couchbase blog posting that gives a pretty detailed breakdown of how to do what you want to do in Couchbase.
This is Java API specific but the general approach can work with any of the Couchbase SDKs, I'm actually in the midst of doing something pretty similar right now with the node SDK.
I can't speak for what couchbase engineers recommend but they've posted this blog entry detailing how to do it.
For large files, you'll certainly want to split into chunks. Do not attempt to store a big file all in one document. The approach I'm looking at is to chunk the data, and insert it under the file sha1 hash. So file "Foo.docx" would get split into say 4 chunks, which would be "sha1|0", "sha1|1" and so on, where sha1 is the hash of the document. This would also enable a setup where you can store the same file under many different names.
Tradeoffs -- if integration with Amazon S3 is an option for you, you might be better off with that. In general chunking data in a DB like what I describe is going to be more complicated to implement, and much slower, than using something like Amazon S3. But that has to be traded off other requirements, like whether or not you can keep sensitive files in S3, or whether you want to deal with maintaining a filesystem and the associated scaling of that.
So it depends on what your requirements are. If you want speed/performance, don't put your files in Couchbase -- but can you do it? Sure. I've done it myself, and the blog post above describes a separate way to do it.
There are all kinds of interesting extensions you might wish to implement, depending on your needs. For example, if you commonly store many different files with similar content, you might implement a blocking strategy that would allow single-store of many common segments, to save space. Other solutions like S3 will happily store copies of copies of copies of copies, and gleefully charge you huge amounts of money to do so.
EDIT as a follow-up, there's this other Couchbase post talking about why storing in the DB might not be a good idea. Reasonable things to consider - but again it depends on your application-specific requirements. "Use S3" I think would be generally good advice, but won't work for everyone.
MongoDB has an option to do this sort of thing, and it's supported in almost all drivers: GridFS. You could do something like GridFS in Couchbase, which is to make a metadata collection (bucket) and a chunk collection with fixed size blobs. GridFS allows you to change the blob size per file, but all blobs must be the same size. The filesize is stored in the metadata. A typical chunk size is 2048, and are restricted to powers of 2.
You don't need memory cache for files, you can queue up the chunks for download in your app server. You may want to try GridFS on Mongo first, and then see if you can adapt it to Couchbase, but there is always this: https://github.com/couchbaselabs/cbfs
This is the best practice: do not take couchbase database as the main database consider it as sync database because no matter how you chunk data into small pieces it will go above 20MB size which will hit you in long run, so having a strong database like MySQL in a middle will help to save those large data then use couchbase for realtime and sync only.
Does anybody have insights on the internal working of NativeS3FileSystem with different InputFormat's in Amazon EMR case as compared to normal Hadoop HDFS i.e. input split calculation, actual data flow? What is the best practices & points to consider when using Amazon EMR with S3?
Thanks,
What's important is that if you're planning to use S3N instead of HDFS, you should know that it means you will lose the benefits of data locality, which can have a significant impact on your jobs.
In general when using S3N you have 2 choices for your jobflows:
Stream data from S3 as a replacement for HDFS: this is useful if you need constant access to your whole dataset, but as explained there can be some performance constraints.
Copy your data from S3 to HDFS: if you only need access to a small sample of your data at some point in time, you should just copy to HDFS to retain the benefit of data locality.
From my experience I also noticed that for large jobs, splits calculation can become quite heavy, and I've even seen cases where the CPU was at 100% just for calculating input splits. The reason for that is that I think the Hadoop FileSystem layer tries to get the size of each file separately, which in case of files stored in S3N involves sending API calls for every file, so if you have a big job with many input files that's where the time can be spent.
For more information, I would advise taking a look at the following article where someone asked a similar questions on the Amazon forums.
The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.
I've been following Hadoop for a while, it seems like a great technology. The Map/Reduce, Clustering it's just good stuff. But I haven't found any article regarding the use of Hadoop with SQL Server.
Let's say I have a huge claims table (600 million rows) and I want to take advantage of Hadoop. I was thinking but correct me if I'm wrong, I can query my table and extract all of my data and insert it into hadoop in chunks of any type (xml, json, csv). Then I can take advantage of Map/Reduce and Clustering with at least 6 machines and leave my SQL Server for other tasks. I'm just throwing a bone here I just want to know if anybody has done such a thing.
Importing and exporting data to and from a relational database is a very common use-case for Hadoop. Take a look at Cloudera's Sqoop utility, which will aid you in this process:
http://incubator.apache.org/projects/sqoop.html