Controling HDFS Replication ,mappers number and Reducers identification - apache

I am trying to run Apache Hadoop 2.65 in a distributed way (with a cluster of 3 computers) and I want to decide the number of mappers and reducers.
I am using HDFS with number of replication 1 and my input is 3 files (tables).
I want to adjust the way data flows in the system and for that, I would like to get some help with the following manners by is it possible? and how and where can I change it?
Replication of HDFS- Can I interfere with the way the replication of HDFS has been done? for example, make sure that each
file stored in a different computer? and if so can I choose on which
computer it will be stored?
Number of mappers- Can I change the number of mappers or input splits? I know that it is decided by the number of input splits and block size. It said on the web that I can do that by changing the following parameters but I don't know where?
-D mapred.map.tasks=5
mapred.min.split.size property
Reducers identification- How can I suggest or force the Resource manager to start the reduce containers (reduce tasks) on specific computers? and if so can I select their amount for each computer? (divide the map out output differently across the cluster). More specifically, add another parameter to the ContainerLaunchContext (we have Mem, CPU, Disk, and Locality).

Replication of HDFS- Can I interfere with the way the replication of HDFS has been done?
Ans- Yes we can change replication factor in hdfs. just go config file change there.
Number of mappers- Can I change the number of mappers or input splits?
Ans - We can change number of mappers also in hdfs.

Related

flink streaming or batch processing

I am tasked with redesigning an existing catalog processor and the requirement goes as belowRequirement I have 5 to 10 vendors(each vendor can have multiple stores) who would provide me with 'XML' file per store. Basically, 1 products xml file per Store, and multiple Store files per Vendor. Max file size can be 500 MB and min can be 100 MB Avg products per file could be 100,000.
Sample xml format could be like this ... ... ...
It doesnt take more than 30 mins to download the file per store, and these files are updated once per day or every 3 to 6 hours.
Now priority requirement is that, the product details are highly unorganized and these files have to organized, processed(10+ processes) and converted to another common object(json) and then file stored in Cassandra.
My technology head advised me to design with Apache Flink and Kafka on top of HDFS, where flink directly stream the files from the vendor servers and start processing them while streaming.
My view was that, either case the files are of finite size and there is not much need to stream them. So thought of having a standalone scheduler come downloader to download and load the files to HDFS. As soon as the files are loaded to HDFS, I can trigger the Flink processing and store the same in Cassandra.
My question here is that, knowing the files are of finite size and finite counts irrespsective of the number of vendors, Is stream processing a overkill or a Batch processing would be a latency burden later?
The question is highly dependent on the tool you will use. If you go for Flink I believe that using the stream is fine and won't create problem in the long run. If you write your functions and jobs properly, moving from DataStream API to DataSet API would be easy, if needed. Batch here introduces an useless delay and without further informations doesn't seem the appropriate approach. I believe it would work fine anyway but it's not clear if latency is a strict requirement.
That said, I believe Flink in itself is an overkill. In this particular use case a more traditional like Spark would be a better option in terms of usability but if you want to invest on Flink, it's totally fine and given the use case, I don't think you will need any particular library that is present/integrated with spark but missing on Flink.

Maintaining logs in redis cache using java

Requirement - That our application processes files containing records and we have to maintain the log for the records in every file. The log file could easily be 100 MB at times in size.
Solution - Since database operation would be very heavy, so we wanted to go for in-memory cache. Write the logs for a particular file into a redis key (key might be the unique file name itself). Later when the user wants to see the log file, application should be able to read the contents from the cache using the unique key file name and write its content into a file which the user can see/download.
Question - Is this a good idea that, we keep appending the logs for a particular file to the same key and later when we have to write to the file, we read from the key and write the contents to the file? Basically the value of the redis key would always be string and its size might run into 100 MBs in size. Will there be any problems because of this?
You can achieve with redis easily, but don't forget that redis is in-memory store (make sure you don't run out of RAM). Ask yourself why you want to go for in-memory store over normal disk operations while dealing with files. If you feel like more frequent read operations happens and accessing time is crucial go ahead with redis.
Regarding size - 100MB is not a problem, in redis string can hold upto 512MB & List, Set, Hashes can hold >4billion records
I prefer MongoDB(which is a disk-based document store) for this kind of operations over redis.
Consider looking at this link to know when redis is awesome.

Speeding up S3 to GCS transfer using GCE and gsutil

I plan on using a GCE cluster and gsutil to transfer ~50Tb of data from Amazon S3 to GCS. So far I have a good way to distribute the load over however many instances I'll have to use but I'm getting pretty slow transfer rates in comparison to what I achieved with my local cluster. Here are the details of what I'm doing
Instance type: n1-highcpu-8-d
Image: debian-6-squeeze
typical load average during jobs: 26.43, 23.15, 21.15
average transfer speed on a 70gb test (for a single instance): ~21mbps
average file size: ~300mb
.boto process count: 8
.boto thread count: 10
Im calling gsutil on around 400 s3 files at a time:
gsutil -m cp -InL manifest.txt gs://my_bucket
I need some advice on how to make this transfer faster on each instance. I'm also not 100% on whether the n1-highcpu-8-d instance is the best choice. I was thinking of possibly parallelizing the job myself using python, but I think that tweaking the gsutil settings could yield good results. Any advice is greatly appreciated
If you're seeing 21Mbps per object and running around 20 objects at a time, you're getting around 420Mbps throughput from one machine. On the other hand, if you're seeing 21Mbps total, that suggests that you're probably getting throttled pretty heavily somewhere along the path.
I'd suggest that you may want to use multiple smaller instances to spread the requests across multiple IP addresses; for example, using 4 n1-standard-2 instances may result in better total throughput than one n1-standard-8. You'll need to split up the files to transfer across the machines in order to do this.
I'm also wondering, based on your comments, how many streams you're keeping open at once. In most of the tests I've seen, you get diminishing returns from extra threads/streams by the time you've reached 8-16 streams, and often a single stream is at least 60-80% as fast as multiple streams with chunking.
One other thing you may want to investigate is what download/upload speeds you're seeing; copying the data to local disk and then re-uploading it will let you get individual measurements for download and upload speed, and using local disk as a buffer might speed up the entire process if gsutil is blocking reading from one pipe due to waiting for writes to the other one.
One other thing you haven't mentioned is which zone you're running in. I'm presuming you're running in one of the US regions rather than an EU region, and downloading from Amazon's us-east S3 location.
use the parallel_thread_count and parallel_process_count values in your boto configuration (usually, ~/.boto) file.
You can get more info on the -m option by typing:
gsutil help options

Node-Local Map reduce job

I am currently attempting to write a map-reduce job where the input data is not in HDFS and cannot be loaded into HDFS basically because the programs using the data cannot use data from HDFS and there is too much to copy it into HDFS, at least 1TB per node.
So I have 4 directories on each of the 4 nodes in my cluster. Ideally I would like my mappers to just receive the paths for these 4 local directories and read them, using something like file:///var/mydata/... and then 1 mapper can work with each directory. i.e. 16 Mappers in total.
However to be able to do this I need to ensure that I get exactly 4 mappers per node and exactly the 4 mappers which have been assigned the paths local to that machine. These paths are static and so can be hard coded into my fileinputformat and recordreader, but how do I guarantee that given splits end up on a given node with a known hostname. If it were in HDFS I could use a varient on FileInputFormat setting isSplittable to false and hadoop would take care of it but as all the data is local this causes issues.
Basically all I want is to be able to crawl local directory structures on every node in my cluster exactly once, process a collection of SSTables in these directories and emit rows (on the mapper), and reduce the results (in the reduce step) into HDFS for further bulk processing.
I noticed that the inputSplits provide a getLocations function but I believe that this does not guarantee locality of execution, only optimises it and clearly if I try and use file:///some_path in each mapper I need to ensure exact locality otherwise I may end up reading some directories repeatedly and other not at all.
Any help would be greatly appreciated.
I see there are three ways you can do it.
1.) Simply load the data into HDFS, which you do not it want to do. But it is worth trying as it will be useful for future processing
2.) You can make use of NLineInputFormat. Create four different files with the URLs of the input files in each of your node.
file://192.168.2.3/usr/rags/data/DFile1.xyz
.......
You load these files into HDFS and write your program on these files to access the data data using these URLs and process your data. If you use NLineInputFormat with 1 line. You will process 16 mappers, each map processing an exclusive file. The only issue here, there is a high possibility that the data on one node may be processed on another node, however there will not be any duplicate processing
3.) You can further optimize the above method by loading the above four files with URLs separately. While loading any of these files you can remove the other three nodes to ensure that the file exactly goes to the node where the data files are locally present. While loading choose the replication as 1 so that the blocks are not replicated. This process will increase the probability of the maps launched processing the local files to a very high degree.
Cheers
Rags

Will doing fork multiple times affect performance?

I need to read log files (.CSV) using fastercsv and save the contents of it in a db (each cell value is a record). The thing is there are around 20-25 log files which has to be read daily and those log files are really large (each CSV file is more then 7Mb). I had forked the reading process so that user need not have to wait a long time but still reading 20-25 files of that size is taking time (more then 2hrs). Now I want to fork reading of each file i.e there will be around 20-25 child process getting created, my question is can I do that? If yes will it affect the performance and is fastercsv able to handle this?
ex:
for report in #reports
pid = fork {
.
.
.
}
Process.dispatch(pid)
end
PS:I'm using rails 3.0.7 and Its going to happen in server which is running in amazon's large instance(7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform)
If the storage is all local (and I'm not sure you can really say that if you're in the cloud), then forking isn't likely to provide a speedup because the slowest part of the operation is going to be disc I/O (unless you're doing serious computation on your data). Hitting the disc via several processes isn't going to speed that up at once, though I suppose if the disc had a big cache it might help a bit.
Also, 7MB of CSV data isn't really that much - you might get a better speedup if you found a quicker way to insert the data. Some databases provide a bulk load function, where you can load in formatted data directly, or you could turn each row into an INSERT and file that straight into the database. I don't know how you're doing it at the moment so these are just guesses.
Of course, having said all that, the only way to be sure is to try it!