Gemfire Group By functions - gemfire

Is there an option to use Group by, aggregate functions in Gemfire OQL? Couldnt find any relevant documentation. Only relevant google result points to a old version of gemfire - not sure if something has changed regarding this?
We're looking at avoiding iteration and let Gemfire do the heavy lifting.
http://forum.spring.io/forum/spring-projects/data/gemfire/108266-no-group-by-having-order-by-key-words-in-gemfire-oql

That is correct; Pivotal GemFire does not support GROUP BY aggregate functions yet.
Although, you may be aware that Pivotal GemFire was submitted to the ASF over a year ago as the Apache Geode project. Recently, the Geode engineering team added support for GROUP BY along with the associated aggregate functions, such as COUNT, MIN, MAX, AVG, SUM, etc
However, Apache Geode's and Pivotal GemFire have diverged significantly, so Apache Geode contains quite a few new features that Pivotal GemFire does not (e.g. Off-Heap memory support as well).
But, the plan is that Pivotal GemFire and Apache Geode will converge and that Pivotal GemFire 9.0 will be based on the Apache Geode core, thus inheriting all the new features, like Off-Heap and GROUP BY, etc.
I have no timeframe for when that will happen, but it is the plan.
Cheers,
John

Related

How to monitor JVM memory?is there any tool avaiable other than JVisual VM, also have tried with jconsole but it is giving data only for one day

is there any tool avaiable other than JVisual VM, also have tried with jconsole but it is giving data only for one day
Java Mission Control is pretty good and available for free in Java 7 above I believe.....for commercial one we're using DynaTrace, and it provides other detailed information in addition to JVM info.
New Relic is another tool for constant application monitoring with a lot of out-of-the box options. Among others it allows to see history records (for hours, days, weeks, months) including memory, cpu, etc.

Whats the main differences between Mapreduce and apache's hama?

Hi I am finding it difficult comparing mapreduce with hama, I understand that hama uses this bulk synchronous parallel model and that the worker nodes can communicate with one another whereas in apache's hadoop the worker nodes only communicate to the namenode correct? If so I don't understand the benefits hama would have over a standard mapreduce in hadoop thanks!
Can you go through this PDF link
This explains the difference between MapReduce and BSP(Apache Hama offers Bulk Synchronous Parallel computing engine).
MapReduceframework has been used to solve a number of non-trivial problems in academia. Putting MapReduce on strong theoretical foundations is crucial in understanding its capabilities. T
whileHamause BSP model of computation, underlining the relevance of BSP to modern parallel algorithm design and defining a subclass of BSP algorithms that can be efficiently implemented in MapReduce.

zookeeper vs redis server sync

I have a small cluster of servers I need to keep in sync. My initial thought on this was to have one server be the "master" and publish updates using redis's pub/sub functionality (since we are already using redis for storage) and letting the other servers in the cluster, the slaves, poll for updates in a long running task. This seemed to be a simple method to keep everything in sync, but then I thought of the obvious issue: What if my "master" goes down? That is where I started looking into techniques to make sure there is always a master, which led me to reading about ideas like leader election. Finally, I stumbled upon Apache Zookeeper (through python binding, "pettingzoo"), which apparently takes care of a lot of the fault tolerance logic for you. I may be able to write my own leader selection code, but I figure it wouldn't be close to as good as something that has been proven and tested, like Zookeeper.
My main issue with using zookeeper is that it is just another component that I may be adding to my setup unnecessarily when I could get by with something simpler. Has anyone ever used redis in this way? Or is there any other simple method I can use to get the type of functionality I am trying to achieve?
More info about pettingzoo (slideshare)
I'm afraid there is no simple method to achieve high-availability. This is usually tricky to setup and tricky to test. There are multiple ways to achieve HA, to be classified in two categories: physical clustering and logical clustering.
Physical clustering is about using hardware, network, and OS level mechanisms to achieve HA. On Linux, you can have a look at Pacemaker which is a full-fledged open-source solution coming with all enterprise distributions. If you want to directly embed clustering capabilities in your application (in C), you may want to check the Corosync cluster engine (also used by Pacemaker). If you plan to use commercial software, Veritas Cluster Server is a well established (but expensive) cross-platform HA solution.
Logical clustering is about using fancy distributed algorithms (like leader election, PAXOS, etc ...) to achieve HA without relying on specific low level mechanisms. This is what things like Zookeeper provide.
Zookeeper is a consistent, ordered, hierarchical store built on top of the ZAB protocol (quite similar to PAXOS). It is quite robust and can be used to implement some HA facilities, but it is not trivial, and you need to install the JVM on all nodes. For good examples, you may have a look at some recipes and the excellent Curator library from Netflix. These days, Zookeeper is used well beyond the pure Hadoop contexts, and IMO, this is the best solution to build a HA logical infrastructure.
Redis pub/sub mechanism is not reliable enough to implement a logical cluster, because unread messages will be lost (there is no queuing of items with pub/sub). To achieve HA of a collection of Redis instances, you can try Redis Sentinel, but it does not extend to your own software.
If you are ready to program in C, a HA framework which is often forgotten (but can be quite useful IMO) is the one coming with BerkeleyDB. It is quite basic but support off-the-shelf leader elections, and can be integrated in any environment. Documentation can be found here and here. Note: you do not have to store your data with BerkeleyDB to benefit from the HA mechanism (only the topology data - the same ones you would put in Zookeeper).

Memory requirements when hosting R in the cloud

What is the minimal size server we need to run opencpu, if we expect 100,000 hits a month?
I think opencpu is an exciting project, but need to know about memory usage when opencpu is deployed, since a cloud hosting service such as rackspace charges about $40 per month for 1 GB of RAM.
I know that if I load R without doing anything or without loading any data or package in RAM, it uses almost 700m of RAM (virtual) and 50 megabytes of RAM (in residence).
I know that opencpu uses rApache, and rApache uses preforking, but want to know how this will scale as the number of concurrent users increases. Thanks.
Thanks for the responses
I talked with Jeroen Ooms when visiting LA, and am partly convinced that opencpu will work in high concurrency environments if used correctly, and that he is available to fix issues if they arrise. Opencpu related to his dissertation, after all! In particular, what I find useful about opencpu is its integration with ubuntu's AppArmor, which can restrict processes from using too much RAM and CPU. I think apache might also be able to do this, but RAppArmor can do this and much more. Brilliant! If AppArmor were the only advantage, I would just use that and json as a backend, but it seems like opencpu can also streamline the installation of all this stuff and provides a built in API system.
Given the cost of web-hosting, I imagine a workable real-time analytics system is the following:
create R statistical models on demand, on a specialized analytical server, as often as needed (e.g. every day or hour using cron)
transfer the results of the models to a directory on the opencpu servers using ftp, as native R objects
on the opencpu server, go to the directory and grab the R objects representing the statistical models, and then make predictions or do simulations using it. For example, use the 'predict' function to provide estimates based on user supplied variables.
Does anybody else see this as a viable way to make R a backend for real time analytics?
Dirk is right, it all depends on the R functions that you are calling; the overhead of the OpenCPU architecture should be quite minimal. OpenCPU itself will run on a small server, but as we all know some functionality in R requires much more memory/cpu than others.
If you really want to know how much resources are needed just to run opencpu you could do some benchmarking. As you noted, prefork is used to branch sessions of the main process, so in most cases the copy-on-write principle of forking should make it pretty cheap.
Also there is some other stuff that you can tweak; e.g. preloading of frequently used packages.

distributed caching on mono

I'm searching for a distributed caching solution on mono similar to java's terracotta and infinispan. I want to use it as a level 2 cache for nhibernate. Velocity and sharedcache have no mono support, and memcached isn't distributed nor have high availability.
Best Regards,
sirmak
You are looking for a more sophisticate data grid solution that will provide scaling and high availability, memcache I find is a bit too primitive for such a requirments. I would advise to look into Gigaspaces XAP or VMware Gemfire. Both are java product that have .net clients both are very strong. Gigaspaces may offer a bit more co-location capabilities.
I think you meant "replicated" instead of "distributed". Memcached is indeed distributed, but not replicated. However, you can make it replicated with this patch.