Protocol Buffers and Hadoop - serialization

I am new to the Hadoop world. I know that Hadoop has its own serialization mechanism called Writables. And that AVRO is another such library. I wanted to know whether we can write map-reduce jobs using the Google's protocol buffer serialization? If yes then can someome point to a good example to get me started.

Twitter has published their elephant-bird library which allows hadoop to work with protocol buffers files.

Related

Using Apache hudi library in java clients

I am a hudi newbie. I was wondering if Hudi client libraries can be used straight from java clients to write to Amazon S3 folders. I am trying to build a system that can store a large no. of events upto 50k/second that will be emitted from a distributed system of over 10 components. I was wondering if I could build a simple client using a Hudi client library that buffers this data and then periodically just writes it into a Hudi datastore?

Can I use a distinct Protocol Buffer version for Pig jobs than the underlying hadoop infrastructure uses?

A hadoop cluster I work on appears to deploy with a protocol buffer jar for its own use. I would like to write Pig scripts that store using its own version of protocol buffers by means of a UDF. Is it possible for my UDFs to use a different version of protocol buffers than the underlying hadoop system?
For context, compiled protocol buffer code should not be interchanged among versions even though the wire format can be. So I would need assurances that if I am supplying my jar I am not replacing the standard variant for Hadoop internal work. Also, I find the supplied standard version of protocol buffers on my cluster incompatible with the compiled message I supplied (confirmed).
At issue here is whether I need to have versions of the protocol buffer library 100% in sync at all times between the cluster and my library code. This would introduce tight coupling and future maintenance headaches. I am likely to switch over to Thrift which apparently has no pre-existing dependencies on the cluster.

HiveServer versus HiveServer2

I know that HiveServer doesnot support multi-client concurrency and authentication and is handled in HiveServer2.
I want to know how this is handled in HiveServer2 and why it doesnot support in HiveServer.
Thanks,
Sree
The answer for this question is simple, which I got to know few days ago.
Each and every client has to be connected through THRIFT API in hiveserver or hiveserver2 which in turn launches the process to convert the client code to hive understandable code by launching language specific class libraries.
As everyone is aware, a process can be single or multi threaded. In hiveserver1, the process launched is single threaded as the class libraries doesnot support multi threads. In hiveserver2, these have been upgraded to multi thread class libraries and thus supports multiple sessions.
Related to security, please refer the link below
http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/
Thanks,
Sree

Amazon S3/OpenStack Swift API skeleton

I would like to implement a cloud storage service with the same interface of OpenStack Swift or Amazon S3. In other words, my cloud storage service should expose the same API of the above-mentioned services but with a custom implementation. This way, a client will be able to interoperate with my service without changing its implementation.
I was wondering if there is an easier approach than manually implementing such interfaces starting from the documentation: http://docs.openstack.org/api/openstack-object-storage/1.0/content/ http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html
For instance, it would be nice if there was a "skeleton" of OpenStack Swift or Amazon S3 APIs from which I can start implementing my service.
Thanks
I found exactly what I was looking for:
https://github.com/jubos/fake-s3
https://github.com/scireum/s3ninja
These tools emulate most of Amazon S3 API. They are meant for development and test purposes but in my case I can use them as a starting point for implementing my cloud storage service.
Someone has done this for you, try jcloud, it supports AWS S3 and swift: Apache jclouds®.
I would recommend using Swift (Openstack object store ) which also supports S3 API
Take a look at the following link:
http://docs.openstack.org/grizzly/openstack-object-storage/admin/content/configuring-openstack-object-storage-with-s3_api.html
This way you can work with openstack swift or Amazon S3
Another option is libcloud, it is a python abstraction that supports a number of providers (including S3 and Swift):
https://libcloud.readthedocs.org/en/latest/storage/index.html
http://libcloud.apache.org/
If you are looking for an enterprise / carrier grade object storage software solution, look at Cloudian http://www.cloudian.com.
Cloudian's software delivers a fully Amazon S3 compliant API, meaning that it delivers the broadest range of S3 feature coverage and 100% fidelity with the AWS S3 API.
The software comes with a Free 10TB license, so pretty much it is free up to 10TB of managed storage, after that it is reasonably priced. You can install the software in any x86 hardware running Linux.
Cloudian does not support the Swift API though.
[Disclaimer: I work for Cloudian]

is it possible to have InterProcess communication in Java?

I have two Java Programs each running in its own JVM instance ? Can they communicate with each other using any IPC technique like Shared Memory or Pipes ? Is there a way to do it ?
Yes; D-BUS and Pipes are both easy to use, and cross-platform. D-BUS is useful for general message-passing IPC, and pipes for sending bulk data.
You can also open a TCP or UDP socket on localhost, if you need to support multiple clients connecting to a central server.
I also found an implementation of UNIX sockets in Java, though it requires the JNI.
http://java.sun.com/javase/technologies/core/basic/rmi/index.jsp
Java Remote Method Invocation (Java RMI) enables the programmer to create distributed Java technology-based to Java technology-based applications, in which the methods of remote Java objects can be invoked from other Java virtual machines*, possibly on different hosts. RMI uses object serialization to marshal and unmarshal parameters and does not truncate types, supporting true object-oriented polymorphism.
Sure. Have a look at RMI or a Shared Memory concept like Java Spaces.
Use MemoryMappedByteBuffer in Java NIO for sharing memory between processes.
There is a fairly new initiative for language-agnostic IPC of columnar (i.e. array-based) data from Apache called Plasma.
As of yet (Sept '17) there are no JVM bindings, but as the project is backed by the likes of Spark, I think it won't be long before we see an implementation.
My understanding however is that there is not a general IPC system, as it is geared toward sharing arrays of primitives like double,long, for scientific computing, rather than classes/objects; though I could be wrong here.
On the plus side, it's also language-agnostic, so you could use it to communicate with another (non-JVM) runtime. However, the OP did ask for Java IPC so this could be irrelevant.