Can I use a distinct Protocol Buffer version for Pig jobs than the underlying hadoop infrastructure uses?

Can I use a distinct Protocol Buffer version for Pig jobs than the underlying hadoop infrastructure uses? - apache-pig

A hadoop cluster I work on appears to deploy with a protocol buffer jar for its own use. I would like to write Pig scripts that store using its own version of protocol buffers by means of a UDF. Is it possible for my UDFs to use a different version of protocol buffers than the underlying hadoop system?
For context, compiled protocol buffer code should not be interchanged among versions even though the wire format can be. So I would need assurances that if I am supplying my jar I am not replacing the standard variant for Hadoop internal work. Also, I find the supplied standard version of protocol buffers on my cluster incompatible with the compiled message I supplied (confirmed).
At issue here is whether I need to have versions of the protocol buffer library 100% in sync at all times between the cluster and my library code. This would introduce tight coupling and future maintenance headaches. I am likely to switch over to Thrift which apparently has no pre-existing dependencies on the cluster.

Related

Can boringssl work in bare metal ARM system?

Can boringssl work on ARMv8 bare metal platform? I tried build boringssl with aarch64-elf-gcc, but it refused to build.
If it does, any porting guide or suggestions?

Probably not out of the box. But you should probably not even try using it, mainly because, according to Google itself, it is not intended for general use.
This is never good to be on your own when using a library, more specifically a cryptographic one. This is usually synonym for no bug fixes, no support, no user forums among other things.
You could rather consider a library that was designed for this purpose, such as mbedtls (formerly known as PolarSSL).
It is being used on a wide range of systems, from bare-metal systems (FreeRTOS) to Linux (The Hiawhata web server does use it for example).
Update: Even if support for Armv8-a hardware crypto extensions is needed, you could still reuse BoringSSL Armv8-a optimized routines (ISC license) or the Cavium armv8_crypto library (BSD license), to replace mbedtls (Apache 2.0 lisense) equivalent routines: cryptographic functions usually have clean and small interfaces.
From my experience, this may still be faster than porting a library targeting a general purpose operating system if your target is a bare-metal one, but you ultimately have to evaluate the costs for both options in your specific case.
My guess would be that there is far less work involved for adding support for Armv8-a crypto extensions to mbedtls using already existing, supported code under the proper license, than attempting to strip-down openssl or boringssl for use on a bare-metal target.
There is a very good piece of documentation explaining how to add support for hardware-accelerated crypto to mbedtls here, this may help you evaluating your options.

How to monitor JVM without installing JDK

I want to monitor JVM performance on my production environment. I have installed only JRE, not JDK, Hence i can't use jstat, jconsole etc. to monitor the JVM performance.
Can somebody please help to understand how can i monitor JVM performance in this scenario?
Is there any way to achieve this?
(please note that i don't want to monitor it remotely through JMX or something else. i would like to install local agent in each machine which will send the metrics to server at the interval of 1 minute.)
Thanks,
KS

If you manage to get JMX up and running on your VM (from the comment), you can then use jmxterm or jmxfetch to push these JMX metrics into a metrics system (like graphite or Datadog).

If you have enough patience and time to write, you can probably have a look at JVMTI. You can write your code in C/C++ and run it along your Java Process and you can gather information about the JVM without affecting it.
Another simple and naive way is to start your VM with a javaagent written in java but JVMTI is even better than that. The most crucial difference between the javaagent and JVMTI app is derived from the loading mechanics. While the agents are loaded inside the heap, they are governed by the same JVM. Whereas the JVMTI agents are not governed by the JVM rules and are thus not affected by the JVM internals such as the GC or runtime error handling.
You can even give Java Mission Control a try if you're using JDK7 or above :)

Jolokia is a java agent you can use to expose JMX as http. Run jmx2graphite and get those metrics into Graphite. The link includes instructions on Graphite installation (10 minutes)

Role of the JVM

Would the JVM (and probably also the CLI) be considered a virtual machine (the equivalent of the x86 in a "normal" program stack) or a virtual OS (the equivalent of Windows)?

Strictly speaking, it is a virtual machine, ie: it executes a special low-level language (similar to x86 ASM. CLI uses MSIL, JVM uses "byte codes") and translates them into the target machine's op-codes (x86, x86_64, ARM .. etc.) for execution on the host CPU.
It also manages marshaling (ie: correct handling and passing through of variables to native memory stack/heap) to allow function calls from inside the managed world to the outside OS on which the VM runs.
Practically though, neither the JVM nor the CLI alone are very helpful except for automated garbage collection and CPU-architecture-independence, but they are complemented by a large base library (the Java classes, or the .NET BCL) which allows you to do many platform-y things without having to call platform specific APIs and use marshaling manually for everything.
That's why there is a different Java Runtime Environment for each OS. Each one's JVM translates to a specific CPU arch, and uses different platform specific-APIs to accomplish what the unified base library exposes to you as a friendly API inside the managed world.
Hope that helps you.

The jvm is considered a real computer, only not realized in hardware. The machine has it's own storage capacity, it's own memory model, it's own specific behaviour of it's central processing unit and it's own internal machine code. This machine is extendable with new possibilities and modules that are represented with classes, API's, etc...
It has it's own stack based architecture, like most virtual machines.

Protocol Buffers and Hadoop

I am new to the Hadoop world. I know that Hadoop has its own serialization mechanism called Writables. And that AVRO is another such library. I wanted to know whether we can write map-reduce jobs using the Google's protocol buffer serialization? If yes then can someome point to a good example to get me started.

Twitter has published their elephant-bird library which allows hadoop to work with protocol buffers files.

is it possible to have InterProcess communication in Java?

I have two Java Programs each running in its own JVM instance ? Can they communicate with each other using any IPC technique like Shared Memory or Pipes ? Is there a way to do it ?

Yes; D-BUS and Pipes are both easy to use, and cross-platform. D-BUS is useful for general message-passing IPC, and pipes for sending bulk data.
You can also open a TCP or UDP socket on localhost, if you need to support multiple clients connecting to a central server.
I also found an implementation of UNIX sockets in Java, though it requires the JNI.

http://java.sun.com/javase/technologies/core/basic/rmi/index.jsp
Java Remote Method Invocation (Java RMI) enables the programmer to create distributed Java technology-based to Java technology-based applications, in which the methods of remote Java objects can be invoked from other Java virtual machines*, possibly on different hosts. RMI uses object serialization to marshal and unmarshal parameters and does not truncate types, supporting true object-oriented polymorphism.

Sure. Have a look at RMI or a Shared Memory concept like Java Spaces.

Use MemoryMappedByteBuffer in Java NIO for sharing memory between processes.

There is a fairly new initiative for language-agnostic IPC of columnar (i.e. array-based) data from Apache called Plasma.
As of yet (Sept '17) there are no JVM bindings, but as the project is backed by the likes of Spark, I think it won't be long before we see an implementation.
My understanding however is that there is not a general IPC system, as it is geared toward sharing arrays of primitives like double,long, for scientific computing, rather than classes/objects; though I could be wrong here.
On the plus side, it's also language-agnostic, so you could use it to communicate with another (non-JVM) runtime. However, the OP did ask for Java IPC so this could be irrelevant.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas