How do I run a data-dependent function on a partitioned region in a member group? - gemfire

My team uses Geode as a makeshift analytics engine. We store a collection of massive raw data objects (200MB+ each) in Geode, but these objects are never directly returned to the client. Instead, we rely heavily on custom function execution to process these data sets inside Geode, and only return the analysis result set.
We have a new requirement to implement two tiers of data analytics precision. The high-precision analytics will require larger raw data sets and more CPU time. It is imperative that these high-precision analyses do not inhibit the low-precision analytics performance in any way. As such, I'm looking for a solution that keeps these data sets isolated to different servers.
I built a POC that keeps each data set in its own region (both are PARTITIONED). These regions are configured to belong to separate Member Groups, then each server is configured to join one of the two groups. I'm able to stand up this cluster locally without issue, and gfsh indicates that everything looks correct: describe member shows each member hosting the expected regions.
My client code configures a ClientCache that points at the cluster's single locator. My function execution command generally looks like the following:
FunctionService
.onRegion(highPrecisionRegion)
.setArguments(inputObject)
.filter(keySet)
.execute(function);
When I only run the high-precision server, I'm able to execute the function against the high-precision region. When I only run the low-precision server, I'm able to execute the function against the low-precision region. However, when I run both servers and execute the functions one after the other, I invariably get an exception stating that one of the regions cannot be found. See the following Gist for a sample of my code and the exception.
https://gist.github.com/dLoewy/c9f695d67f77ec18a7e60a25c4e62b01
TLDR key points:
Using member groups, Region A is on Server 1 and Region B is on Server 2.
These regions must be PARTITIONED in Production.
I need to run a data-dependent function on one of these regions; The client code chooses which.
As-is, my client code always fails to find one of the regions.
Can someone please help me get on track? Is there an entirely different cluster architecture I should be considering? Happy to provide more detail upon request.
Thanks so much for your time!
David
FYI, the following docs pages mention function execution on Member Groups, but give very little detail. The first link describes running data-independent functions on member groups, but doesn't say how, and doesn't say anything about running data-dependent functions on member groups.
https://gemfire.docs.pivotal.io/99/geode/developing/function_exec/how_function_execution_works.html
https://gemfire.docs.pivotal.io/99/geode/developing/function_exec/function_execution.html

Have you tried creating two different pools on the client, each one targeting a specific server-group, and executing the function as usual with onRegion?, I believe that should do the trick. For further details please have a look at Organizing Servers Into Logical Member Groups.
Hope this helps. Cheers.

As the region data is not replicated across servers it looks like you need to target the onMembers or onServers methods as well as onRegion.

Related

Custom Dataflow Template - BigQuery to CloudStorage - documentation? general solution advice?

I am consuming a BigQuery table datasource. It is 'unbounded' as it is updated via a batch process. It contains session keyed reporting data from server logs where each row captures a request. I do not have access to the original log data and must consume the BigQuery table.
I would like to develop a custom Java based google Dataflow template using beam api with the goals of :
collating keyed session objects
deriving session level metrics
deriving filterable window level metrics based on session metrics, e.g., percentage of sessions with errors during previous window and percentage of errors per filtered property, e.g., error percentage per device type
writing the result as a formatted/compressed report to cloud storage.
This seems like a fairly standard use case? In my research thus far, I have not yet found a perfect example and still have not been able to determine the best practice approach for certain basic requirements. I would very much appreciate any pointers. Keywords to research? Documentation, tutorials. Is my current thinking right or do I need to consider other approaches?
Questions :
beam windowing and BigQuery I/O Connector - I see that I can specify a window type and size via beam api. My BQ table has a timestamp field per row. Am I supposed to somehow pass this via configuration or is it supposed to be automagic? Do I need to do this manually via a SQL query somehow? This is not clear to me.
fixed time windowing vs. session windowing functions - examples are basic and do not address any edge cases. My sessions can last hours. There are potentially 100ks plus session keys per window. Would session windowing support this?
BigQuery vs. BigQueryClientStorage - The difference is not clear to me. I understand that BQCS provides a performance benefit, but do I have to store BQ data in a preliminary step to use this? Or can I simply query my table directly via BQCS and it takes care of that for me?
For number 1 you can simply use a withTimestamps function before applying windowing, this assigns the timestamp to your items. Here are some python examples.
For number 2 the documentation states:
Session windowing applies on a per-key basis and is useful for data that is irregularly distributed with respect to time. [...] If data arrives after the minimum specified gap duration time, this initiates the start of a new window.
Also in the java documentation, you can only specify a minimum gap duration, but not a maximum. This means that session windowing can easily support hour-lasting sessions. After all, the only thing it does is putting a watermark on your data and keeping it alive.
For number 3, the differences between the BigQuery IO Connector and the BigQuery storage APIs is that the latter (an experimental feature as of 01/2020) access directly data stored, without the logical passage through BigQuery (BigQuery data isn't stored in BigQuery). This means that with storage APIs, the documentation states:
you can't use it to read data sources such as federated tables and logical views
Also, there are different limits and quotas between the two methods, that you can find in the documentation link above.

Can I "pin" a Geode/Gemfire region to a specific subset of servers?

We make heavy use of Geode function execution to operate on data that lives inside Geode. I want to configure my cluster so functions we execute against a certain subset of data are always serviced by a specific set of servers.
In my mind, my ideal configuration looks like this: (partitioned) Region A is only ever serviced by servers 1 and 2, while (partitioned) Region B is only ever serviced by servers 3, 4, and 5.
The functions that we execute against the two regions have very different CPU/network requirements; we want to isolate the performance impacts of one region from the other, and ideally be able to tune the hardware for each server accordingly.
Assuming, operationally that you're using gfsh to manage your cluster, you could use groups to logically segregate your cluster by assigning each server to a relevant group. Creating regions then simply requires you to also indicate which 'group' a region should be created on. Functions should already be constrained to execute against a given region with FunctionService.onRegion() calls.
Note: If you're perusing the FunctionService API, don't be tempted to use the onMember(group) methods as those unfortunately only work for peer-to-peer (server-to-server) calls. I'm assuming here that you're doing typical client-server calls. Of course, if you are doing p2p function calls then those methods would be totally relevant.
You can split your servers into different groups and then create the regions into these specific groups, allowing you to correctly "route" the function executions. You can get more details about this feature in Organizing Peers into Logical Member Groups.
Hope this helps. Cheers.

configure parallel async event queue on replicated region in Gemfire

I'm trying to configure Gemfire/Geode in order to have an async event queue with parallel=true on a replicated region. However, I'm getting the following exception at startup:
com.gemstone.gemfire.internal.cache.wan.AsyncEventQueueConfigurationException: Parallel Async Event Queue myQueue can not be used with replicated region /myRegion
This (i.e. to prevent parallel queues on replicated regions) seems to be a design decision, but I can't understand why it is the case.
I have read all the documentation I've been able to find (primarily http://gemfire.docs.pivotal.io/docs-gemfire/latest/reference/book_intro.html and related docs),
and searched any kind of reference to this exception on the internet, but I didn't find any clear explanation on why I can't have an event listener on each member hosting a replicated region.
My conclusion is that I must be missing some fundamental concept about replicated regions and/or parallel queues, but since I can't find the appropriate documentation
on my own, I'm asking for an explanation and/or pointers to the right resources to read.
Thanks in advance.
EDIT : Let me put the question into context.
I have an external system sending data to my application using REST services, which are load balanced between nodes in order to maximize performance. Each of the nodes hosts the same regions (let's say 3, named A B and C). The data travels through all those regions (A to B to C) and is processed along the way. This means that region A hosts data that has just been received, region B data that has been partially processed and region C hosts data whose processing is complete.
I am using event listeners to process data and move it from region to region, and in case of the listener for region C, to export it to another external system.
All the listeners must (and I repeat, must) be transactional.
I also need horizontal scalability (i.e. adding nodes on the fly to increase throughput) and the maximum amount of data replication that can be possibily achieved.
Moreover, I want to run all of the nodes with the same gemfire configuration.
I have already tried to use partitioned regions, but they are not fit to my needs for a bunch of reasons that I won't explain here for the sake of brevity (just trust me, it is not currently possible).
So I thought that having all the nodes host the replicated regions could be the way, but I need all of them to be able to process events independently and perform region synchronization afterwards in an active/active scenario. It is my understanding that this requires event queues to be parallel, but it does not seem possible (by design).
So the (updated) question(s) are :
Is this scenario even possible? And if it is, how can I achieve it?
Any explanation and/or documentation, example, resource or anything else is more than welcome.
Again, thanks in advance.
An AsyncEventQueue is used to write data that arrives in GemFire to some other data store. You would ideally want to do this only once. Since the content of the replicated region is same on all the members of the system, you only need a Async event listener on one member, hence parallel=true is not supported.
For Partitioned regions, if you only had one member that hosts the AsyncQueue, then every single put to a partitioned region will also be routed through that member. This introduces a single point of contention in the system. The solution to this problem was introduction of parallel AsyncQueues, so that events on each member are only queued up locally in that member.
GemFire also supports CacheListeners, which are invoked on each member even for replicated regions, however, they are synchronous. You can introduce a thread pool in your CacheListener to get the same functionality.

BigQuery UDF works in one project but not another

I have been using UDF's for a few months now with a lot of success. Recently, I set up separate projects for development, and stream a sample of 1/10 of our web tracking data into these projects.
What I'm finding is that the UDF's I use in production, which operate on the full dataset, are working, while the exact same query in our development project consistently fails, despite querying 1/10 of the data. The error message is:
Query Failed
Error: Resources exceeded during query execution: UDF out of memory.
Error Location: User-defined function
I've looked through our Quotas and haven't found anything that would be limiting the development project.
Does anybody have any ideas?
If anybody can look into it, here are the project ids:
Successful query in production: bquijob_4af38ac9_155dc1160d9
Failed query in development: bquijob_536a2d2e_155dc153ed6
Jan-Karl, apologies for the late response; I've been out of the country to speak at some events in Japan and then have been dealing with oncall issues with production.
I finally got a chance to look into this for you. The two job ids you sent me are running very different queries. The queries look the same, but they're actually running over views, which have different definitions. The query that succeeded is a straight select * from table whereas the one that has the JS OOM is using a UDF.
We're in the midst of rolling out a fix for the JS OOM issue, by allowing the JavaScript engine to use more RAM.
...
...and now for some information that's not really relevant to this case, but that might be of future value...
...
In theory, it could be possible for a query to succeed in one project and fail in another, even if they're running over exactly the same dataset. This would be unusual, but possible.
Background : BigQuery operates and maintains copies of customer data in multiple datacentres for redundancy. Different projects are biased to run in different datacentres to help with load spreading and utilisation.
A query will run in the default datacentre for its project if the data is fresh enough. We have a process that replicates the data between datacentres, and we avoid running in a datacentre that has a stale copy of the data. However, we run maintenance jobs to ensure that the files that comprise your data are of "optimal" size. These jobs are scheduled separately per datacentre, so it's possible that your underlying data files for the same exact table would have a different physical structure in cell A and cell B. It would be possible for this to affect aspects of a query's performance, and in extreme cases, a query may succeed in cell A but not B.

Where should we calculate fields?

I'm currently working in a Silverlight / MS SQL project where the Entity Framework has not been implemented and I would like to know what's the best practice to deal with calculated fields in this particular situation.
Considering that some external system might also consume my data directly in the DB or thru a web service, here's the 3 options I can see right now.
1) Force any external system to consume data thru a web service and create all the calculated fields in the objects only.
2) Create the calculated fields in a DB view and resync your object with the server each time a value needs to be calculated.
3) Replicate the calculation rules in the object and the database view.
Any other suggestions would also be welcomed.
I would recommend to follow two principles: data decoupling and minimum functionality duplication. Both would suggest to put your calculations in one place only, and serve them already calculated. So I would implement the calculations in the DB, and serve them via a web service.
However, you have to consider your particular case. For example, if the calculations are VERY heavy, you could delegate them to the client to spare server resources. This could even be the reason you are using Silverlight. I am in a similar situation on a project, and I found that the best compromise is to push raw data to the client and have it do the heavy computations.
Having a best practice or approach for this kind of problem is difficult as circumstances change what was formerly a good approach might start to seem less useful. That said where possible I would do anything data related at the DB level including calculated fields. This way you know no matter where you are looking at the data from you will see the same results. So your web service, SQL reporting and anything else that needs to look at or receive data will see the same result.