Why we need Hadoop distributions? - apache

I am new to Hadoop. So, please can anybody explain to me why we need cloudera or Hortonworks? We can download each Apache project and use those libraries to create Big Data project, right? And, also if I already use linux OS, do I have to use cloudera-quickstart vm ware? Thanks in advance.

Lets look at this in using a similar analogy.
Lets assume you are using OS 'D' of version 'v1'. In it you need different set of libraries - A,B and C.
A depends on B and also C depends on B. Across the versions of A and C, the dependencies are different versions of C.
Now if you need all the three libraries, it becomes your head ache to make sure you use/install libraries of each such that each are compatible and there's no clash.
Plus not everyone is expert in all the three libraries as well as the underlying system. So what happens if there some optimization needed in using these libraries while using them in your own tools? Also what about some issues that you face while using them.
That's where these "Stack Distributions" come into play. Each of these vendors provide a complete stack which is tested as a whole and are compatible with the different libraries that are packaged and not just only hadoop. This makes lives of lots of people easier. Also based on what plan or subscription you have with the vendor, you can get support/training and other auxiliary things.
Just to add as an extra, please remember, Open Source does not necessarily mean Free.(Please note that this is my personal opinion)
As to your other part of question wrt with linux do you need to use any vm ware image or so, for a beginner or learning purposes, this makes your life rather simpler.

Related

Migrating to Xi52

we are planning to migrate our codebase from xI50 to xI52. Could anyone please let me know, how xI52 is different from XI50 ?I am just tryingt to figure out what kind of changes will need to be done to our existing codebase on xI50 to make it compatible on xI52?
Also, I have below two questions:
1) Is Xi52 the best hardware to which we should migrate from Xi50? What are the advantage of Xi52 from others?
2) What are the best practices to migrate the configuration from Xi50 to Xi52?
Regards,
Rahul
Good question Rahul!
As a rule of thumb, the compability of the codebase is essentially an aspect of the running firmware, not the model in itself.
So, figure out the firmware version of your XI50 and your target firmware, which probably is the latest. If you upgrade to the same firmware version (such as version 5), there should be no issues.
Here is a list of all firmwares.
http://www-01.ibm.com/support/docview.wss?uid=swg21237631
The release notes to look for information in each firmware, so go through the list (you may have to aggregate through several firmwares, such as 4.x to 5.x, 5.x to 6.x).
Sometimes, you have to look into individual technotes for info.
In general, DP contains compability rather well, and breaking changes resides mainly in details.
http://pic.dhe.ibm.com/infocenter/wsdatap/v5r0m0/index.jsp?topic=%2Fcom.ibm.dp.xi.doc%2FrelnotesXI.html
http://pic.dhe.ibm.com/infocenter/wsdatap/v6r0m0/index.jsp?topic=%2Fcom.ibm.dp.xi.doc%2FrelnotesXI.html

choosing versioning software

I work on windows and I need a very simple version track software to be able to check in/out a folder project, no matter what's inside. I downloaded few programs, but most of them are very complicated, team work, cloud, thousands of options, etc.
I need some simple version track of my files, locally. Can you recommend me something useful?
i'd recommend using simpy rar with datetime. Or as an option - just parameter to update archive instead of creating new one
There are a number of reasons why version control software have the number of options that they do, without understanding the basics of how the particular version control system that you are trying to use functions these options can seem overwhelming. To be able to use version control you will have to put in a little bit of effort to understand how it works. That being said I find that Bazaar from Canonical makes a pretty good introduction to version control for beginners. It has a pretty nice download page for various platforms and comes with a GUI client and comes with beginner friendly documentation.
However, having used other version control systems I personally don't like to use Bazaar. The choice of version control system should not make a difference if you are only looking to use it yourself and don't need any of the more advanced features. If you are willing to invest some more time however, I would recommend trying Mercurial it has some documentation for beginners and a fairly nice beginner friendly GUI for Windows in the form of EasyMercurial.

D Development Process

What is the recommended development process for D programs that use packages that are cloned from github and separately built?
Typically in relation to how C/C++ projects are built using make, autotools, cmake, etc.
Most other build specifications have an install target. Should there be an install target in the build or should we just link a library directly from where it is placed when built and add register its includes in D_INCLUDE_PATH and then direct to them using DFLAGS=-I<D_INCLUDE_PATH>?
I realise my comment can actually be an answer to the question, so here it is:
D development process can't be different than similar in C or C++ world. Is that really difficult to see? Almost all C and C++ compilers generate "native" code. D is not an exception. There was the D.NET project that could target .NET, but it is inactive for years...
Furthermore, all tools used in C/C++ based projects can be easily used for anything else. CMake can be used in Java or .NET projects as well. Same goes for Make and/or Autotools. Why are Maven and Ant more popular in Java world is a different story.
Speaking about them, you can use Maven or Ant in the D development process! Hands down, you need to write your own Maven plugins to make it more easy and flexible, but it is doable, and would in fact be a very nice project.
From what I have seen, D programmers stick to the good, old Make, or write BASH script to do the whole thing. However, I've seen people from the Lycus foundation use WAF. If you are Python programmer, you will just LOVE WAF. If not, try similar things - I've seen people use SCons, Remake, Premake, etc...
DSSS+Rebuild is the closest thing to a very useful such tool made with D. Unfortunately they are dead projects. :(
I am working on a maven-style tool, but considering the amount of time I have - it will be usable in 2014. :)
I would go with scons, which has support for D, thanks to Russel Winder:
http://scons.tigris.org/ds/viewMessage.do?dsForumId=1268&dsMessageId=2959039
If not, then POM (plain old make).

objective-c frameworks - Dynamic Library Install Name

I'm new to objective-c & osx architecture. I started playing with building a framework and then using it. I followed this great tutorial.
During the tutorial, I had to set the framework's target's Dynamic Library Install Name to #rpath/MyFramework.framework/Versions/A/MyFramework. My understanding is that #rpath will expand to the loader's (consumer's) run-path search paths.
It seems as if the responsibility of loading the framework is split between the framework author and the consumer author. Could someone please explain why the author of the framework needs to be concerned with the consumer's run-path search path? For example, if the framework-author set the Dynamic Library Install Name to point to some random directory (instead of #rpath) how would the client be able to consume the framework?
Thanks in advance.
It depends a lot on how the framework is being used. And it's important to remember that the framework construct has existed for a long time on the platform.
For a system framework, such as the ones that Apple creates, you're going to be quite happy that they keep the frameworks in a known location. In those cases, the paths that they use are fixed for the OS, and it guarantees that you don't accidentally load the wrong one. Further, as indicated in the Framework documentation, these frameworks are loaded only once on the machine, regardless of how many times they are used (see Apple:What Are Frameworks) . The benefit here is performance and it is for both the code and the resources in many cases.
Due to the recent move to randomize framework locations,and Apple's comments in the release notes that "Mountain Lion randomly relocates the kernel, kexts, and system frameworks at system boot," it certainly appears they're still sharing these resources, and thus still gaining from this benefit.
For embedded frameworks, the situation is a lot more tedious, and Apple has moved through a variety of methods over the years to make it easier to find frameworks wherever they may be. Due, again, to the shared nature, it would make sense for Applications which share common library requirements to share them on the machine, both for purposes of efficiency, and to make sure they're at the same version if they're sharing data. So, for example, if you have two separate apps that use the same framework to work with shared data, you might put the shared framework in /Library/Frameworks and have both apps explicitly look for that, making sure that some other (possibly older) version of the framework, that has been loaded by another App, is not used instead.
In the end, there's a lot of flexibility for the Framework producer and consumer the way that it currently works. It means that the developer can decide to share a framework, include a private copy of the framework, or even do both, depending upon whether the framework exists on the machine or not. However, the price for that flexibility is the complexity that we have today.
Another example of a reason you might not want to use #rpath specifically is for tightly-linked embedded frameworks (yes, people embed frameworks within other frameworks). In these cases, you don't know where the first framework is loaded, but you want to put the embedded framework inside of it, so that they stay together. In this case #loader_path is relative to the code that is loading it, so that your plug-in's framework can find its resources correctly.
In answer to your specific example about somebody setting the Dynamic Library Install Name
to a "random" location. In this case, you'd have to know that location. There might be many reasons for somebody doing this, such as wanting to discourage reuse by other programs, or because there are large resources within the framework that should only be installed in a known shared location.

Most appropriate platform independent development language

A project is looming whereby some code that I will be writing may be deployed on any hardware that potential clients happen to have. Its a business application that will be running 24/7 so I envisage that most of the host machines will be server type boxes but smaller clients might, for example, just have a simple PC.
A few more details about the code I will be writing:
There will be no GUI.
It will need to communicate with another bespoke 'black box' device over an Ethernet network.
It will need to communicate with a MySQL database somewhere on the network.
I don't have any performance concerns as a) the number of communications with the black box will be small, around 1 per second, and the amount of data exchanged will be tiny (around 1K each time), b) the number of read/writes with the database will be small, around 5 per minute, and again the amount of data exchanged will be tiny and c) the processing that needs to be performed is fairly simplistic.
Nothing I'm doing is very 'close to the metal' so I don't want to use languages that are too low level. Ease of development and ease of deployment are my main priorities.
I'm not expecting there to be a perfect solution so I can live with things like, for example, having to have slightly different configuration files for Windows machines than for Linux boxes etc. I would like to avoid having to compile the software for each host machine if possible though.
I would value your thoughts as to which development language you think is most suitable.
Cheers,
Jim
I'd go with a decent scripting language such as Python, Perl or Ruby personally. All of those have decent library support, can communicate easily with both local and remote MySQL databases and are pretty platform independent.
The first thing we need to know is what language skills you already have? This is likely to be a fairly big determiner of what choice you would ideally make.
If I was doing this I'd suggest Java for a couple of reasons:
It will run almost anywhere and meet the requirements you've outlined.
Its not an esoteric language so there will be plenty of developers.
I already know how to program in it!
Probably the most extensive library ecosystem of any of the development platforms.
Also note that you could write it in another language on the JVM if your more comfortable with Ruby or Python.
Sounds like Perl or Python would fit the bill perfectly. Which one you choose would depend on the expertise of the people building and supporting the system.
On the subject of scripting languages versus Java, I have been disappointed with developing command line tools using Java. You can't directly execute them, you have to (1) compile them and (2) write a shell script to execute the jar file, this script may differ between platforms. I recommend Python because it runs anywhere and it's got a great SQL library, mysql-python. The library is ready to use on Windows and Linux. Python also has a lot less boilerplate, you'll write fewer lines of code to do the same thing.
EDIT: when I talked about JARs being executable or not, I was talking about whether they are directly executable be the OS. You can, of course, double click on them to run them if your file manager is set up to do so. But when you're in a terminal window and you want to run a java program, you have to "java -jar myapp.jar" instead of the usual "./myapp.jar". In Python one just runs "./myapp.py" and doesn't have to worry about compiling or class paths.
If all platforms are standard PCs (or at least run Linux), then Python should be considered. You can compile it yourself if no package exists for your version. Also, you can strip the standard library easily from things that aren't available and which you don't need (sound support, for example).
Python doesn't need lots of resources, it's easy to learn and read.
If you know Perl, you can try that. If you don't use Perl on a daily basis, then don't. The Perl syntax is hard to remember and after a week, you'll wonder what the code did, even if you wrote it yourself.
Perl may be of help to you as it is available for many platforms and you can get almost any functionality by simply installing modules from CPAN.
Python or Java. They both are easy to deploy on both the server environments and the desktop environments you mention - i.e., Linux/Solaris and Windows.
Perl is also a nice choice, but it depends on how well you know Perl, how well other people that will maintain your code know Perl, and number of desktop users that are savvy enough to handle an install of the Windows Perl version(s).
As Java supports Python via Jython, I'd go with a JVM requirement myself, but I'd personally go with a Java application all the way for such a system you describe.
I would say use C or C++. They are platform independant, though you will have to compile for each platform.
Or use Java. That runs in a Virtual Machine so is truely cross platform and not a slow level as C.