Hadoop release version confusing - apache

I am trying to figure out the different versions of hadoop and I got confusing after reading this page.
Download
1.2.X - current stable version, 1.2 release
2.2.X - current stable 2.x version
2.3.X - current 2.x version
0.23.X - similar to 2.X.X but missing NN HA.
Releases may be downloaded from Apache mirrors.
Question:
I think any release starting with 0.xx means it is a alpha version and should be not used in product, is that the case?
What is the difference between 0.23.X and 2.3.X? it mentioned they are similar but missing namenode? high availability? is there any correlation between 0.23 and 2.3? Is it because when they develop the code, the PMC group say "man! it is so immature and should let it start with 0, since they are the same product, I will keep the digits the same?"
When I look at the source code of the new hadoop, I see the jobtracker class turned out to be a dummy class. And I am envisioning the jobtracker and tasktracker, ie. Mapreduce1 will slowly fade away on the roadmap of Hadoop, which in another case, the interface for the Map Reduce Job might keep the same, but the second generation of Hadoop (YARN) will totally replace the idea of Jobtracker and Tasktracker with ResourceManager..etc.
Sorry that this question might be a bit unorganized since I got really confused by the version number. I will modify the question after I figured it out.

First of all: there's a major difference between Hadoop v1 and v2 (aka YARN). The v1's NameNode and JobTracker are replaced by the new ResourceManager for better scalability. That's why both will disappear later on in the development.
Second: 0.X versions are subtle no hint for alpha releases: OpenSSL was over ten years a 0.9 release (en.wikipedia.org/wiki/OpenSSL#Major_version_releases) even though it was considered being a de facto standard or reference implementation. And many Fortune 500 companies trusted in it.
And that's true for Hadoop as well. The 0.23 version refers to Hadoop v1's architecture that has v2 implementations (except High Availability as the NameNode is still v1's). So 0.23 and 2.3 are about the same and continue aging in parallel. They named it 0.X as 1.X is already in use. They just don't wanted 1.X keep aging to indicate that 2.X is the way to go -- you may use 0.X only if you rely on 1.X's architecture but on the other hand want to receive minor improvements from the current development in 2.X.
The bottom part tries to explain this, but is a bit better skelter as well: http://wiki.apache.org/hadoop/Roadmap. The top part here does it a bit better: http://hadoop.apache.org/releases.html
Hope this was helpful...

From the image below you can notice that Hadoop 2.6.2 has been released after 2.71
Reasoning
2.6 to 2.6.2 is a MINOR API update and IS backward compatible.
2.6 to 2.7 is a MAJOR API update EG IS NOT backward compatible. Some API's may now be obsolete.
Ref Hadoop Road map

Related

why does tensorflow have multi versions released at the same time.? such as 2.5.1,2.4.3,2.3.4 release at the same time

I noticed each time,when tensorflow released,https://github.com/tensorflow/tensorflow/releases
there are multi different version released at the same time,
what's the differences between the versions.
Software, especially open source software usually has multiple releases at the same time ( Long Term Support, Stable, WHQL, etc...), as well as depending on the how many people use specific version, they might keep that version updated more ( either because breaking changes that the users can't update or other reasons )
Whenever there is a critical/security bugfix, if possible, all the branches will receive that bugfix..
This is what happened with TensorFlow, the latest release on 3 versions was all security/vulnerability bugfixes

Optimizing a neural net for running in an embedded system

I am running some code on an embedded system with an extremely limited memory, and even more limited processing power.
I am using TensorFlow for this implementation.
I have never had to work in this kind of environment before.
What are some steps I can take to ensure I am being efficient as possible in my implementations/optimization?
Some ideas -
- Pruning code -
https://jacobgil.github.io/deeplearning/pruning-deep-learning
- Ensure loops are as minimal as possible (in the big O sense)
- ...
Thanks a lot.
I suggest using TensorFlow Lite.
It will enable you to compress and quantize your model to make it smaller and faster to run.
It also supports leveraging GPU and/or hardware accelerator if any of this is available to you.
https://www.tensorflow.org/lite
If you are working with TensorFlow 1.13 (the latest stable version before the 2.0 prototype), there is a pruning function from tf.contrib submodule. It contains a sparcity parameter that you can tune to determine the size of the network.
I suggest you to take a look at all the tf.contrib.model_pruning submodule here. It's plenty of functions you might need for your specific task.

Running gradient-free optimization methods in parallel with OpenMDAO and PyOptSparse

I would like to run ALPSO and NSGA2 from OpenMDAO using the PyOptSparse driver in parallel. The catch is that I don't want to run the model itself in parallel (which I have done frequently in OpenMDAO), I just want to run the optimization computations in parallel (e.g. distribute the calculations for swarm members of ALPSO). I have been looking through the documentation and source for all of the above mentioned codes, but I have not found a way to do this. Could someone point me in the right direction?
Note: I am currently using OpenMDAO 1.7.3, but I am open to answers involving later versions
I don't believe that those optimizers support parallel execution. It would most likely require modifications to the code in ALPSO/NSGA2, pyoptsparse, and the pyoptsparse driver to support this.
In OpenMDAO 2.2 (the latest version), we do have a simple GA driver that can run the evaluation of points in the population in parallel, so maybe that is an option. (it is pretty simple though, and only supports single objective.)

Building new TensorFlow Op, is there a build system standard?

After watching this question I decided to give writing a new op for TensorFlow a try.
Since the requirements of C++, Python and likely a *nix system are not my primary tools, I would like to avoid being at a point where I have to back out and make a system/tool changes just because I did not ask.
Is there a standard or preferred system and or tools used by those working or TensorFlow?
I know that recommendation questions are not allowed here; I am not asking for a personal recommendation, I am asking for the standard used by or what the TensorFlow group finds that works.
Really, anything where you can get Bazel and the required libraries up and running. But since you're starting from scratch: Ubuntu's a very safe bet and (I haven't measured this, but this is a solid estimate) probably gets the most testing and development by the tf team. But there are many options that all work -- you can develop inside a virtualenv on many environments. Things like GPU support get a little more platform-specific, and that's where Ubuntu starts to become the easiest choice if you don't have any other constraints.
The key requirements are outlined in installing Tensorflow from sources.

IntelliJ memory issues

I have been using IntelliJ IDEA 12 for developing a Java Applications. I have the best experience for an IDE. It was working fine until recently. It started to show the heap size memory issue recommending to increase the Xmx and ask me to ignore or shutdown. The behaviour is weird as the IDE starts at 300 MB then it starts to take more memory until it reaches 750+ MB that's when it shows the problem.
I switched back to eclipse and the memory foot print is stable at 300 MB and doesn't increase by time like IntelliJ
Is IntelliJ doing some background process related to my code causing this increase? or is it a memory leakage problem with the IDE?
I've used IDEA for 10 years (and used IDEA 12 for a year before switching to IDEA 13 EAP builds) and have never had a memory issue. And I do not see any consistent mention of memory issues in the IDEA forums.
That said, a memory leak was just fixed (as in released today) in the IDEA 13 EAP. The VcsLogGraphTable class had a leak. The ticket does not give any indication if the leak was/is present in IDEA 12. Based on the name of the class, it should only come into play for Git or Hg graphs (but Hg graphs were added in 13). Based on my experience with how they do tickets, I interpret this as an IDEA 13 issue.
First, make sure you are using the latest version 12.1.6.
Often times memory issues are a result of a poorly written third party plug-in. You can try to disable any third party plug-ins and see if the issue is resolved.
The other thing you can do is follow the instructions in the document How to report IntelliJ IDEA performance problems and take CPU snapshots and report the issue to JetBrains. That way they can confirm a leak in IDEA 12, or tell you what plug-in is the culprit.