I want to share large in memory static data(RAM lucene index) for my map tasks in Hadoop? Is there way for several map/reduce tasks to share same JVM?
Jobs can enable task JVMs to be reused by specifying the job configuration mapred.job.reuse.jvm.num.tasks. If the value is 1 (the default), then JVMs are not reused (i.e. 1 task per JVM). If it is -1, there is no limit to the number of tasks a JVM can run (of the same job). One can also specify some value greater than 1 using the api.
In $HADOOP_HOME/conf/mapred-site.xml add the follow property
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>#</value>
</property>
The # can be set to a number to specify how many times the JVM is to be reused (default is 1), or set to -1 for no limit on the reuse amount.
Shameless plug
I go over using static objects with JVM reuse to accomplish what you describe here:
http://chasebradford.wordpress.com/2011/02/05/distributed-cache-static-objects-and-fast-setup/
Another option, although more complicated, is to use distributed cache with a read-only memory mapped file. That way you can share the resource across the JVM processes as well.
To my best knowledge, there is no easy way for multiple map tasks (Hadoop) to share static data structures.
This is actually a known problem for current Map Reduce model. The reason that current implementation doesn't share static datas across map tasks is because Hadoop is designed to be highly reliable. As a result, if a task fails, it will only crash its own JVM. It will not impact the execution of other JVMs.
I am currently working on a prototype that can distribute the work of a single JVM across multiple cores (essentially you just need one JVM to utilize multi cores). This way, you can reduce the duplication of in memory data structures without costing CPU utilization. The next step for me is to develop a version of Hadoop that can run multiple Map tasks within one JVM, which is exactly what you are asking for.
There is an interesting post here
https://issues.apache.org/jira/browse/MAPREDUCE-2123
Related
I noticed that my queries are running faster on my local machine compare to my server because on both machines only one core of the CPU is being used. Is there a way to enable multi-threading so I can use 12 (or all 24 cores) instead of just one?
I didn't find anything in the documentation to set this up but saw that other graph databases do support it. If it is supported by default, what could cause it to only use a single core?
GraphDB by default will load all available CPU cores unless limited by the license type. The Free Edition has a limitation up to 2 concurrent read operations. However, I suspect that what you ask for is how to enable the query parallelism (decompose the query into smaller tasks and execute them in parallel).
Write operations in GDB SE/EE will always be split into multiple parallel tasks, so you will benefit from the multiple cores. GraphDB Free is limited to a single core due to commercial reasons.
Read operations are always executed on a single thread because in the general case the queries run faster. In some specific scenarios like heavy aggregates over large collections parallelizing the query execution may have substantial benefit, but this is currently not supported.
So to sum up having multiple cores will help you only handle more concurrent queries, but not process them faster. This design limitation may change in the upcoming versions.
We have setup a 3 node performance cluster with 16G RAM and 8 Cores each. Our use case is to write 1 million rows to a single table with 101 columns which is currently taking 57-58 mins for the write operation. What should be our first steps towards optimizing the write performance on our cluster?
The first thing I would do is look at the application that is performing the writes:
What language is the application written in and what driver is it using? Some drivers can offer better inherent performance than others. i.e. Python, Ruby, and Node.js drivers may only make use of one thread, so running multiple instances of your application (1 per core) may be something to consider. Your question is tagged 'spark-cassandra-connector' so that possibly indicates your are using that, which uses the datastax java driver, which should perform well as a single instance.
Are your writes asynchronous or are you writing data one at a time? How many writes does it execute concurrently? Too many concurrent writes could cause pressure in Cassandra, but not very many concurrent writes could reduce throughput. If you are using the spark connector are you using saveToCassandra/saveAsCassandraTable or something else?
Are you using batching? If you are, how many rows are you inserting/updating per batch? Too many rows could put a lot of pressure on cassandra. Additionally, are all of your inserts/updates going to the same partition within a batch? If they aren't in the same partition, you should consider batching them up.
Spark Connector Specific: You can tune the write settings, like batch size, batch level (i.e. by partition or by replica set), write throughput in mb per core, etc. You can see all these settings here.
The second thing I would look at is look at metrics on the cassandra side on each individual node.
What does the garbage collection metrics look like? You can enable GC logs by uncommenting lines in conf/cassandra-env.sh (As shown here). Are Your Garbage Collection Logs Speaking to You?. You may need to tune your GC settings, if you are using an 8GB heap the defaults are usually pretty good.
Do your cpu and disk utilization indicate that your systems are under heavy load? Your hardware or configuration could be constraining your capability Selecting hardware for enterprise implementations
Commands like nodetool cfhistograms and nodetool proxyhistograms will help you understand how long your requests are taking (proxyhistograms) and cfhistograms (latencies in particular) could give you insight into any other possibile disparities between how long it takes to process the request vs. perform mutation operations.
I am learning Apache Helix. I came across the keyword 'Partitions'.
According to the definition mentioned here http://helix.apache.org/Concepts.html, Each subtask (of a main task) is referred to as a partition in Helix.
When I gone through the recipe - Distributed Lock Manager, partitions are nothing but instances of a resource. (Increase the numOfPartitions, number of locks is increased).
final int numPartitions = 12;
admin.addResource(clusterName, lockGroupName, numPartitions, "OnlineOffline",
RebalanceMode.FULL_AUTO.toString());
Can someone explain with simple example, what exactly the partition in Apache Helix is ?
I think you're right that a partition is essentially an instance of a resource. As is the case in other distributed systems, partitions are used to achieve parallelism. A resource with only one instance can only run on one machine. Partitions simply provide the construct necessary to split a single resource among many machines by, well, partitioning the resource.
This is a pattern that is found in a large portion of distributed systems. The difference, though, is while e.g. distributed databases explicitly define partitions essentially as a subset of some larger data set that can fit on a single node, Helix is more generic in that partitions don't have a definite meaning or use case, but many potential meanings and potential use cases.
One of these use cases in a system with which I'm very familiar is Apache Kafka's topic partitions. In Kafka, each topic - essentially a distributed log - is broken into a number of partitions. While the topic data can be spread across many nodes in the cluster, each partition is constrained to a single log on a single node. Kafka provides scalability by adding new partitions to new nodes. When messages are produced to a Kafka topic, internally they're hashed to some specific partition on some specific node. When messages are consumed from a topic, the consumer switches between partitions - and thus nodes - as it consumes from the topic.
This pattern generally applies to many scalability problems and is found in almost any HA distributed database (e.g. DynamoDB, Hazelcast), map/reduce (e.g. Hadoop, Spark), and other data or task driven systems.
The LinkedIn blog post about Helix actually gives a bunch of useful examples of the relationships between resources and partitions as well.
I have four child pipelines in my project and the output for these is ingested in the main pipeline. I want the output files for the child pipelines to be automatically backed up after baseline in some directory. This will help if i disable individual forges and want to restore previous index. Please help
It should be straight-forward to add copy or archive jobs to your baseline update script to execute this sort of thing before your main forge is run.
radimbe, as forge itself is single-threaded (except for the rarely implemented multi-threaded java manipulator) and monolithic, this sort of architecture is commonly used to take better advantage of multi-processor machines and multi-threaded CPU cores. In addition, if data is made available at different times or with different frequencies, you can decouple child forges from your main baseline process, improving its overall turnaround time. And from a strategy POV, this approach could decompose what might be a large, unwieldy job into perhaps simpler, more focused and more easily maintained components.
How do you assess the level of difficulty writing own process manager based on the sources of hydra (mpich)? ie., for scale 1 to 100? It will be change the part corresponding to the assignment of processes to computers.
This shouldn't be too hard, but Hydra already implements several rank allocation strategies, so you might not even need to write any code.
You can already provide a user-specified rank allocation. Based on the provided configuration, Hydra can use the hwloc library to obtain hardware topology information and bind processes to cores.